What happens when firefox crashes?
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

What happens when firefox crashes?

on

  • 4,337 views

Follow a Firefox crash from its genesis in a collapsing browser process through the dizzying array of collection, storage, and reporting systems that make up Socorro, our open-source crash collector. ...

Follow a Firefox crash from its genesis in a collapsing browser process through the dizzying array of collection, storage, and reporting systems that make up Socorro, our open-source crash collector. Enjoy war stories of weird, interlocking failures, and see how we nevertheless continue to fulfill our mandate: “Never lose a crash.” Observe some patterns that emerged from this system which can be useful in yours.

Statistics

Views

Total Views
4,337
Views on SlideShare
3,547
Embed Views
790

Actions

Likes
12
Downloads
21
Comments
1

8 Embeds 790

http://andrei4.eltargovia.info 569
https://twitter.com 204
http://www.google.com 11
http://nextweb.01.smu.edu.sg 2
http://wtprime.localhost 1
http://pulse.me&_=1374098066910 HTTP 1
http://pulse.me&_=1374266993848 HTTP 1
http://www.pulse.me 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

What happens when firefox crashes? Document Transcript

  • 1. What Happens When Firefox Crashes?or It’s Not My Fault Tolerance by Erik Rose Welcome! [Erik Rose (if not introduced)] write server-side code @ Mozilla to tell you about the Big Data systems behind FF crash reporting •! ❑ ! A browser is a complex piece of software. •! ❑ ! Challenging to test it ▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware. ! •! ❑ ! Even unique timings of your setup can trigger bugs. ! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things ! •! ❑ ***Any of which could make FF explode*** ▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting. So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
  • 2. What Happens When Firefox Crashes?or It’s Not My Fault Tolerance by Erik Rose Welcome! [Erik Rose (if not introduced)] write server-side code @ Mozilla to tell you about the Big Data systems behind FF crash reporting •! ❑ ! A browser is a complex piece of software. •! ❑ ! Challenging to test it ▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware. ! •! ❑ ! Even unique timings of your setup can trigger bugs. ! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things ! •! ❑ ***Any of which could make FF explode*** ▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting. So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
  • 3. •!❑! If you've crashed FF, you've seen this dialog. ! ❑!If you choose to send us a crash report, we use it to… ! •!❑! find new bugs ! •!❑! decide where to concentrate our time
  • 4. Socorro !–! The thing that receives FF crash reports is called Socorro. •!❑! ***Open source.*** •!❑! You can use it if you want. Very flexible. •!❑! Used by Valve, Yandex •!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
  • 5. Socorro https://github.com/ mozilla/socorro !–! The thing that receives FF crash reports is called Socorro. •!❑! ***Open source.*** •!❑! You can use it if you want. Very flexible. •!❑! Used by Valve, Yandex •!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
  • 6. Very Large Array Socorro, New Mexico like that array, it receives signals from out in the universe and tries to filter out patterns from the noise. •!❑! 27 dish antennas, which can move to follow objects across the sky •!❑! Socorro is a Very Large Array of slightly less expensive systems which tracks crashes across the userbase
  • 7. ! Big Picture The Let’s take a peek behind the curtain You’ll recognize some things you’re doing yourself, and some other things might surprise you. So let’s embark on our tour of Socorro!
  • 8. ! •!❑! On its front end, it looks like this. Public. Don’t hide our failures Unusual.
  • 9. You can drill into this, to see e.g. top crashers: ! •!❑! ***% of all crashes*** ! •!❑! signature (stack trace) ! •!❑! breakdown by platform ! •!❑! ticket correllations
  • 10. You can drill into this, to see e.g. top crashers: ! •!❑! ***% of all crashes*** ! •!❑! signature (stack trace) ! •!❑! breakdown by platform ! •!❑! ticket correllations
  • 11. !–! Another example: explosive crashes ! !–! Music charts: "bullets" ! •!❑! song which rises quickly up the charts to suddenly become extremely popular ! •!❑! Something we expect to see as 5% of all crashes, but then you wake up one morning, and they're 85% of all crashes. ! •!❑! Generally what this means is that one of the major sites shipped a new piece of JS which crashes us. ! !✓! The most recent example of this is during the last Olypmics, when Google released a new Doodle every day.
  • 12. ! •!❑! I think it was this one that crashed us. ! •!❑! On the one hand, we knew the problem was going away tomorrow. So that’s nice. ! •!❑! OTOH, a lot of people have Google set as their startup page. So that's bad. ;-)
  • 13. !❑! You can also find… ! •!❑! Most common crashes for a version, platform, etc. ! •!❑! New crashes ! !❑! Correlations ! •!❑! ferret out interactions between plugins, for example •!❑! Pretty straightforward, right? Backend is less straightforward…
  • 14. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 15. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 16. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 17. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users 3000 crashes per minute •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 18. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users 3000 crashes per minute 150KB-20MB per crash •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 19. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users 3000 crashes per minute 150KB-20MB per crash 800GB in PostgreSQL •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 20. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad 500M Firefox users 150M daily users 3000 crashes per minute 150KB-20MB per crash 800GB in PostgreSQL 40TB in HDFS, 110TB replicated •!❑! Over 120 boxes, all physical. !❑! Why physical? ! •!❑! Organizational momentum ! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important. !–! ! •!❑! How much data? ! •!❑! "The smallest big-data project" ! •!❑! Used to be considered big. Not anymore. ! !✓! Numbers ! •!✓! ***500M FF users*** ! •!✓! ***150M ADUs. Probably more.*** ! •!✓! ***3000 crashes/minute.*** 3M/day. ! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash anyway and just full of corrupt garbage) ! •!✓! ***800GB*** in PG ! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data. ! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning. ! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it will be.
  • 21. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad It all starts ***down here***, with FF. But even that’s made up of multiple moving parts.
  • 22. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad It all starts ***down here***, with FF. But even that’s made up of multiple moving parts.
  • 23. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 24. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 25. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 26. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 27. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad These ***first 3*** pieces all on client side ***First 2*** in FF process ! ❑! Breakpad ! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa ! ! ❑! stack dump of all threads ! •!❑! opaque; doesn't even know the frame boundaries ! •!❑! a little other processor state ! •!❑! throws it to another process: ***Crash Reporter*** Why? Remember, FF has crashed. State unknown. “Crash Reporter, which is responsible for ***this little dialog***,” binary crash dump + JSON metadata → POST → collectors…
  • 28. Collectors Materialized View Builders Active Daily Users Signatures Versions Explosiveness cron jobs Zeus load balancer Crash Reporter Breakpad where really enters Socorro***…***
  • 29. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad Collectors: super simple Writes crashes to ***local disk…***
  • 30. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad Then, another process on same box
  • 31. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad Crash Movers picks up crashes off local disk → 2 places
  • 32. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad 1st: → HBase. HBase is primary store for crashes. 70 nodes At the same time***…***
  • 33. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad IDs → Rabbit ! ❑! Soft realtime: and normal queues ! •!❑! Priority: process within 60 secs
  • 34. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors Postgre elasticse Debug symbols on NFS pgbou Zeus Materialized View Builders Active Daily Users Signatures Versions Explosiveness Version Scraper FTP Zeu cron jobs Zeus load balancer Crash Reporter Breakpad !❑! Processors ! •!❑! Where the real action happens ! •!❑! To process a crash means to do what's necessary to make it visible in the web UI. ! •!❑! ID from Rabbit ! •!❑! binary → debug ! •!❑! signature generation ! •!❑! Then it puts it into buckets and adds it to PG and ES. First, PG.
  • 35. Zeus Ze Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Bugzilla Associator Automatic Emailer Bugzilla alized ew ders Users res ns ness ADU Count Loader Version Scraper FTP Vertica Zeus !❑! Postgres ! !❑! Our main interactive datastore ! •!❑! It's what the web app and most batch jobs talk to. ! !❑! Stores (cut?) ! •!❑! unique crash signatures ! •!❑! numbers of crashes, bucketed by signature ! !❑! other aggregations of crash counts on various facets ! •!❑! to make reporting fast ! •!❑! (see slide 32 of breakpad.socorro.master.key.) ! !❑! In there for a couple reasons ! •!❑! Prompt, reliable answers to queries ! !❑! Ref integ ! •!❑! Stores unique crash signatures ! •!❑! And their relationships to versions, tickets, & so on ! •!❑! PHP & Django easy to query from Now, let’s turn around & talk about ES, which operates in parallel.
  • 36. Zeus Ze Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Bugzilla Associator Automatic Emailer Bugzilla alized ew ders Users res ns ness ADU Count Loader Version Scraper FTP Vertica Zeus !❑! Elasticsearch ! •!❑! 90-day rolling window ! •!❑! Faceting ! !❑! NKOTB •! ❑!Extremely flexible text analysis. ! ! ! •! ❑! Though geared toward natural language, we may be able to persuade it to take apart C++ call signatures & let us mine those in meaningful ways. ! !❑! May someday eat some of HBase or Postgres's lunch ! !❑! It scales out like HBase & can even execute arbitrary scripts near the data, collating & returning data through a master node. ! •!❑! Maybe not the flexibilty of full map-reduce ! •!❑! Filter caching ! •!❑! Supports indices itself
  • 37. Duplicate Finder Zeus Zeus HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus ron obs !❑! Web services (“middleware”) ! •!❑! At end of this story: web application ! •!❑! But between it and data is REST middleware ! !❑! Why? ! •!❑! was in PHP and we didn't want to reimplement model logic in 2 languages ! •!❑! We change datastores. ! •!❑! We move data around.
  • 38. Duplicate Finder Zeus Zeus HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus ron obs !✓! Web App ! •!✓! Django ! •!✓! Each runs memcached
  • 39. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad And that concludes our big-picture tour of Socorro! Now, as years have gone by and the system has grown in scope and size, interesting patterns
  • 40. ! Big Patterns tooling was clearly missing. standard practices weren’t good enough. I’m going to call out some of these emergent needs and show you our solutions. Maybe you’ll even find some of our tools useful. The first…
  • 41. ! Big Storage Every Big Data system put everything somewhere Solutions well-established Amount of data you can deal with in a commoditized fashion rises every year sharding, repl expensive We realized by application of statistics ***shrink amount of data***
  • 42. ! Big Storage ***sampling*** per product all FFOS crashes don’t wanna lose interesting rare events (due to sampling) ***targeting*** take anything with a comment •!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For instance, the rules that select interesting events don't throw off our OS or version statistics. ***rarification*** throw away uninteresting parts of stack frames !❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2 kinds. ! •!❑! Sentinel frames to jump TO ! •!❑! Frames that should be ignored An important part of making our hash buckets wider reducing # of unique crash signatures With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline. Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs. But processors, rabbit, PG, ES, memcache, crons—all have lighter load
  • 43. ! Big Storage Sampling ***sampling*** per product all FFOS crashes don’t wanna lose interesting rare events (due to sampling) ***targeting*** take anything with a comment •!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For instance, the rules that select interesting events don't throw off our OS or version statistics. ***rarification*** throw away uninteresting parts of stack frames !❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2 kinds. ! •!❑! Sentinel frames to jump TO ! •!❑! Frames that should be ignored An important part of making our hash buckets wider reducing # of unique crash signatures With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline. Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs. But processors, rabbit, PG, ES, memcache, crons—all have lighter load
  • 44. ! Big Storage Sampling Targeting ***sampling*** per product all FFOS crashes don’t wanna lose interesting rare events (due to sampling) ***targeting*** take anything with a comment •!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For instance, the rules that select interesting events don't throw off our OS or version statistics. ***rarification*** throw away uninteresting parts of stack frames !❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2 kinds. ! •!❑! Sentinel frames to jump TO ! •!❑! Frames that should be ignored An important part of making our hash buckets wider reducing # of unique crash signatures With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline. Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs. But processors, rabbit, PG, ES, memcache, crons—all have lighter load
  • 45. ! Big Storage Sampling Targeting Rarification ***sampling*** per product all FFOS crashes don’t wanna lose interesting rare events (due to sampling) ***targeting*** take anything with a comment •!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For instance, the rules that select interesting events don't throw off our OS or version statistics. ***rarification*** throw away uninteresting parts of stack frames !❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2 kinds. ! •!❑! Sentinel frames to jump TO ! •!❑! Frames that should be ignored An important part of making our hash buckets wider reducing # of unique crash signatures With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline. Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs. But processors, rabbit, PG, ES, memcache, crons—all have lighter load
  • 46. ! Big Systems •!❑! Big Data systems tend to be complicated systems. •!❑! Diverse parts: not just one big 500-node HBase cluster and done !❑! Example: 6 data stores: ! •!❑! FS ! •!❑! PG ! •!❑! ES ! •!❑! HBase ! •!❑! memcache ! •!❑! RabbitMQ ! •!❑! This is typical of architectures now. Gone are the days of 1 datastore, 1 representation. ! •!❑! 18 months ago, was hearing jokes about data mullet: relational in the front, NoSQL in the back. ! •!❑! data dreadlocks. It's all over the place. The kinds of problems you can have in these systems really tough to track down
  • 47. Hadoops! A tale of Big Failure crash every 50 hours ***Hadoop’s cleverness*** with TCP connections TCP stack bugs in Linux lying NICs OS buffers fill up with unclosed connections & crash •!❑! So we're very very cautious about ***the equipment*** we use. Remember that hardware is a nontrivial part of your system ! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong. ! •!❑! Can take time to get everybody together must keep receiving crashes. ***Boxes & springs***
  • 48. Hadoops! A tale of Big Failure Complex interactions crash every 50 hours ***Hadoop’s cleverness*** with TCP connections TCP stack bugs in Linux lying NICs OS buffers fill up with unclosed connections & crash •!❑! So we're very very cautious about ***the equipment*** we use. Remember that hardware is a nontrivial part of your system ! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong. ! •!❑! Can take time to get everybody together must keep receiving crashes. ***Boxes & springs***
  • 49. Hadoops! A tale of Big Failure Complex interactions Hardware matters. crash every 50 hours ***Hadoop’s cleverness*** with TCP connections TCP stack bugs in Linux lying NICs OS buffers fill up with unclosed connections & crash •!❑! So we're very very cautious about ***the equipment*** we use. Remember that hardware is a nontrivial part of your system ! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong. ! •!❑! Can take time to get everybody together must keep receiving crashes. ***Boxes & springs***
  • 50. Hadoops! A tale of Big Failure Complex interactions Hardware matters. Design for failure. crash every 50 hours ***Hadoop’s cleverness*** with TCP connections TCP stack bugs in Linux lying NICs OS buffers fill up with unclosed connections & crash •!❑! So we're very very cautious about ***the equipment*** we use. Remember that hardware is a nontrivial part of your system ! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong. ! •!❑! Can take time to get everybody together must keep receiving crashes. ***Boxes & springs***
  • 51. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad The most important: ***this Local FS***
  • 52. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad The most important: ***this Local FS***
  • 53. Duplicate Finder Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Debug symbols on NFS pgbouncer Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad Everything else can fail 3 days of runway Saved us several times Yours may not look like this, but •!❑! You could imagine a system being able to serve just out of cache if the datastore went away. •!❑! Or operate in read-only mode if writes became unavailable. ! ! ! ! SUMO One thing from this diagram we didn’t talk about much yet was ***cron jobs***.
  • 54. ! Big Batching •!❑! Mozilla is a large project with a long legacy, and Socorro interfaces with a lot of other systems. ***A lot of this occurs via batch jobs.***
  • 55. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad
  • 56. Duplicate Finder MQ Processors PostgreSQL pgbouncer Middleware Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus matviews version scraper, 1x/day bugzilla •!❑! Send advice back to users, like in the case where we see they have malware ADUs denominator for every metric fails a lot. Metrics’ systems unreliable. everything that depends on it fails
  • 57. In fact, you can look at a lot of our periodic tasks as a dependency tree. One thing upstream fails***…***
  • 58. …and downstream everything else fails. replaced cron w/crontabber Instead of blindly running jobs whose prerequisites aren’t filled, runs the ***parent*** until it succeeds, then runs ***children***. Diagrams to visualize state of sys Too error-prone by hand. ***Then*** we thought: why not have crontabber draw them for us?
  • 59. …and downstream everything else fails. replaced cron w/crontabber Instead of blindly running jobs whose prerequisites aren’t filled, runs the ***parent*** until it succeeds, then runs ***children***. Diagrams to visualize state of sys Too error-prone by hand. ***Then*** we thought: why not have crontabber draw them for us?
  • 60. …and downstream everything else fails. replaced cron w/crontabber Instead of blindly running jobs whose prerequisites aren’t filled, runs the ***parent*** until it succeeds, then runs ***children***. Diagrams to visualize state of sys Too error-prone by hand. ***Then*** we thought: why not have crontabber draw them for us?
  • 61. SVGs are really neat. can wiggle if unclear And then break down specifics into a ***table…***
  • 62. One job at a time atm cuz “eek matviews perf”, but a great contribution would be some kind of shared locks or thresholds for multiple. But you know, right now, it’s ***good enough…***
  • 63. ! Big Deal And it’s surprising how often that happens. Oftentimes, your makeshift solutions end up being good enough to do the job.
  • 64. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad ***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors ***Local FS buffer*** was a temporary fix when we had reliability problems with HBase. ***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have. Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
  • 65. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad ***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors ***Local FS buffer*** was a temporary fix when we had reliability problems with HBase. ***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have. Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
  • 66. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad ***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors ***Local FS buffer*** was a temporary fix when we had reliability problems with HBase. ***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have. Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
  • 67. Duplicate Finder Zeus Zeus Collectors Local FS Crash Movers HBase RabbitMQ Processors PostgreSQL elasticsearch Web Front-end memcached Debug symbols on NFS pgbouncer LDAP Middleware Zeus Zeus Bugzilla Associator Automatic Emailer Bugzilla Materialized View Builders Active Daily Users Signatures Versions Explosiveness ADU Count Loader Version Scraper FTP Vertica Zeus cron jobs Zeus load balancer Crash Reporter Breakpad ***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors ***Local FS buffer*** was a temporary fix when we had reliability problems with HBase. ***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have. Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
  • 68. definition: hook up to one computer, or fit on one desk changes every year The fact…wearing nearly 100GB unimaginable to operator of punch card duplicator from only 50 years ago But the patterns that come out of large systems remain. Duplicate cards: why? To facet 2 ways in parallel. While you may need to generalize a bit, I have no doubt techniques you learn today and tomorrow serve you well into the future.
  • 69. Big Thanks twitter: ErikRose www.grinchcentral.com erik@mozilla.com