Follow a Firefox crash from its genesis in a collapsing browser process through the dizzying array of collection, storage, and reporting systems that make up Socorro, our open-source crash collector. Enjoy war stories of weird, interlocking failures, and see how we nevertheless continue to fulfill our mandate: “Never lose a crash.” Observe some patterns that emerged from this system which can be useful in yours.
Packaging is the Worst Way to Distribute Software, Except for Everything Elsemckern
As part of the 2014 USENIX Release Engineering Summit West, I presented a talk about packaging software and what's wrong with current trends.
Here's the abstract:
Reliably distributing software is a notoriously difficult problem, and almost every operating system and programming language vendor has tried to solve it. This has led to a herd of packaging systems, almost none of which are cross-compatible; some manage system-level software, while others focus on extending their own language (often by trampling on system-level software). And like all competing standards, every packaging system comes with its own sharp corners, dull edges, and hidden idiosyncrasies to deal with along the path to packaging happiness. In an attempt to answer the question "How do I install this software and ensure that its dependencies are fulfilled?", some novel solutions have begun to see popular adoption. But a lot of these newer tools and techniques tread the same ground as their predecessors while overlooking the lessons that were learned along the way.
I'll talk about the state of native packaging systems on some popular platforms (Debian/Ubuntu, RHEL/CentOS/Fedora, and Mac OS X), packaging systems for popular languages (Ruby, Python, Perl, and Node) and the ways that developers are attempting to work around the limitations of these systems. I'll review the reasons that tools like curlbash, FPM, and omnibus packages have become popular by sharing lessons I've learned while working through these systems. While this will be an amusing presentation, I'll show how native packages can address the concerns that have pushed Release Engineers and Developers away. I will also talk about what native packaging systems can learn from the next generation of packaging tools.
The original abstract is available here:
https://www.usenix.org/conference/ures14west/summit-program/presentation/mckern
Puppet Camp LA 2015 talk covering: packages, package managers, puppet, and tips, tricks, and puppet modules for setting up secure package repositories.
API design is one of the most difficult areas of programming. Besides solving your immediate problem, you must also accomodate unknown future ones—and fit nicely into other people's brains. Let's explore how to do this without a time machine, considering compactness, orthogonality, consistency, safety, coupling, state handling, and the messy interface with human cognition, all illustrated with practical examples—and gruesome mistakes—from several popular Python libraries.
Packaging is the Worst Way to Distribute Software, Except for Everything Elsemckern
As part of the 2014 USENIX Release Engineering Summit West, I presented a talk about packaging software and what's wrong with current trends.
Here's the abstract:
Reliably distributing software is a notoriously difficult problem, and almost every operating system and programming language vendor has tried to solve it. This has led to a herd of packaging systems, almost none of which are cross-compatible; some manage system-level software, while others focus on extending their own language (often by trampling on system-level software). And like all competing standards, every packaging system comes with its own sharp corners, dull edges, and hidden idiosyncrasies to deal with along the path to packaging happiness. In an attempt to answer the question "How do I install this software and ensure that its dependencies are fulfilled?", some novel solutions have begun to see popular adoption. But a lot of these newer tools and techniques tread the same ground as their predecessors while overlooking the lessons that were learned along the way.
I'll talk about the state of native packaging systems on some popular platforms (Debian/Ubuntu, RHEL/CentOS/Fedora, and Mac OS X), packaging systems for popular languages (Ruby, Python, Perl, and Node) and the ways that developers are attempting to work around the limitations of these systems. I'll review the reasons that tools like curlbash, FPM, and omnibus packages have become popular by sharing lessons I've learned while working through these systems. While this will be an amusing presentation, I'll show how native packages can address the concerns that have pushed Release Engineers and Developers away. I will also talk about what native packaging systems can learn from the next generation of packaging tools.
The original abstract is available here:
https://www.usenix.org/conference/ures14west/summit-program/presentation/mckern
Puppet Camp LA 2015 talk covering: packages, package managers, puppet, and tips, tricks, and puppet modules for setting up secure package repositories.
API design is one of the most difficult areas of programming. Besides solving your immediate problem, you must also accomodate unknown future ones—and fit nicely into other people's brains. Let's explore how to do this without a time machine, considering compactness, orthogonality, consistency, safety, coupling, state handling, and the messy interface with human cognition, all illustrated with practical examples—and gruesome mistakes—from several popular Python libraries.
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesNETWAYS
The log shipping scene been between us for a long time: from syslog, rsyslog to nowadays Fluentd, Flume and Logstash. Logstash been pushing hard to introduce new features that make the experience better for everyone. At the end of the day, a healthy shipper means a happy sysadmin. The latest Logstash includes persistence to reduce the chance of data loss, monitoring to find how everything is going and configuration management to make your life a lot easier. But wait, there’s more! Offline support, improved shutdown semantics, etc … features that will make your logs shipped and you a rested sysadmin.
In this talk we’ll see this features in action thought a real live sensor monitoring example. By the end of the session, you will be able to use the full power of Logstash in your own deployments.
Symfony Live NYC 2014 - Rock Solid Deployment of Symfony AppsPablo Godel
Web applications are becoming increasingly more complex, so deployment is not just transferring files with FTP anymore. We will go over the different challenges and how to deploy our PHP applications effectively, safely and consistently with the latest tools and techniques. We will also look at tools that complement deployment with management, configuration and monitoring.
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony AppsPablo Godel
Web applications are becoming increasingly more complex, so deployment is not just transferring files with FTP anymore. We will go over the different challenges and how to deploy our PHP applications effectively, safely and consistently with the latest tools and techniques. We will also look at tools that complement deployment with management, configuration and monitoring.
Desert Code Camp 2014: C#, the best programming languageJames Montemagno
Desert Code Camp 2014: C#, the best programming language.
Throughout the years many programming languages have come and gone, but C# is here to stay. It is everywhere and can run on over 2.5 Billion devices including desktop, web, servers, mobile devices, and game consoles! Come learn why I love C# so much and all of the amazing features it has to offer. This session will be action packed with so much live coding you will not know what to do!
My talk at Hack in the Box 2010 - Kuala Lumpur
It has been a decade since I started talking about computer security. 10 years have witnessed a change in threat landscapes, attack targets, exploits, techniques and damage. Two eco-systems are slowly and surely converging into one. On one hand, we have the application layer. Much has been talked about it. There is a steady trickling flow of XSS, XSRF, SQL injection and the usual suspects. Some of them are under the guise of "Web 2.0", and some of them are as ancient as CGI attacks of 1999. On the other hand, we have the desktop. Dominating the desktop is the browser, with its horde of assistants. Exploitation in this space has accelerated in the last 3 years.
How will the threat landscape change with the advent of new technologies and services? New standards are emerging, and the darling child of the web is HTML 5. A closer look at standards reveals and awful mess. Are the standards mitigating any security concerns? More importantly, will browser vendors and web application developers really respect the standards? The browser wars taught us that "might is right". If everyone breaks the web, that becomes a new adopted standard. New technologies, coupled with popular online services make for some very interesting exploit delivery techniques.
This talk explores some innovative exploit delivery techniques that are born as a result of bloated standards and services designed without much thought towards security. We cover techniques where exploits can be delivered through URL shorteners and images. We take a look at some browser exploits. This talk ends with a discussion on exploit sophistication, ranging from highly polished and elegant techniques such as Return Oriented Programming to the downright crude and ugly techniques such as DLL Hijacking. How will we combine all this together? And will Anti-Virus still save us all?
Pivotal Open Source: Using Fluentd to gain insights into your logsKiyoto Tamura
Logs: whether you are running simple KPI reports or tuning parameters for your machine learning algorithms, you need them. Many organizations realize this and build a logging infrastructure...kind of poorly.
In this talk, we will give an overview of what good logging infrastructure looks like and what the key ingredients are, and demonstrate how Fluentd, an open source data collector, can be used to build a unified, robust logging layer.
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesNETWAYS
The log shipping scene been between us for a long time: from syslog, rsyslog to nowadays Fluentd, Flume and Logstash. Logstash been pushing hard to introduce new features that make the experience better for everyone. At the end of the day, a healthy shipper means a happy sysadmin. The latest Logstash includes persistence to reduce the chance of data loss, monitoring to find how everything is going and configuration management to make your life a lot easier. But wait, there’s more! Offline support, improved shutdown semantics, etc … features that will make your logs shipped and you a rested sysadmin.
In this talk we’ll see this features in action thought a real live sensor monitoring example. By the end of the session, you will be able to use the full power of Logstash in your own deployments.
Symfony Live NYC 2014 - Rock Solid Deployment of Symfony AppsPablo Godel
Web applications are becoming increasingly more complex, so deployment is not just transferring files with FTP anymore. We will go over the different challenges and how to deploy our PHP applications effectively, safely and consistently with the latest tools and techniques. We will also look at tools that complement deployment with management, configuration and monitoring.
SymfonyCon Madrid 2014 - Rock Solid Deployment of Symfony AppsPablo Godel
Web applications are becoming increasingly more complex, so deployment is not just transferring files with FTP anymore. We will go over the different challenges and how to deploy our PHP applications effectively, safely and consistently with the latest tools and techniques. We will also look at tools that complement deployment with management, configuration and monitoring.
Desert Code Camp 2014: C#, the best programming languageJames Montemagno
Desert Code Camp 2014: C#, the best programming language.
Throughout the years many programming languages have come and gone, but C# is here to stay. It is everywhere and can run on over 2.5 Billion devices including desktop, web, servers, mobile devices, and game consoles! Come learn why I love C# so much and all of the amazing features it has to offer. This session will be action packed with so much live coding you will not know what to do!
My talk at Hack in the Box 2010 - Kuala Lumpur
It has been a decade since I started talking about computer security. 10 years have witnessed a change in threat landscapes, attack targets, exploits, techniques and damage. Two eco-systems are slowly and surely converging into one. On one hand, we have the application layer. Much has been talked about it. There is a steady trickling flow of XSS, XSRF, SQL injection and the usual suspects. Some of them are under the guise of "Web 2.0", and some of them are as ancient as CGI attacks of 1999. On the other hand, we have the desktop. Dominating the desktop is the browser, with its horde of assistants. Exploitation in this space has accelerated in the last 3 years.
How will the threat landscape change with the advent of new technologies and services? New standards are emerging, and the darling child of the web is HTML 5. A closer look at standards reveals and awful mess. Are the standards mitigating any security concerns? More importantly, will browser vendors and web application developers really respect the standards? The browser wars taught us that "might is right". If everyone breaks the web, that becomes a new adopted standard. New technologies, coupled with popular online services make for some very interesting exploit delivery techniques.
This talk explores some innovative exploit delivery techniques that are born as a result of bloated standards and services designed without much thought towards security. We cover techniques where exploits can be delivered through URL shorteners and images. We take a look at some browser exploits. This talk ends with a discussion on exploit sophistication, ranging from highly polished and elegant techniques such as Return Oriented Programming to the downright crude and ugly techniques such as DLL Hijacking. How will we combine all this together? And will Anti-Virus still save us all?
Pivotal Open Source: Using Fluentd to gain insights into your logsKiyoto Tamura
Logs: whether you are running simple KPI reports or tuning parameters for your machine learning algorithms, you need them. Many organizations realize this and build a logging infrastructure...kind of poorly.
In this talk, we will give an overview of what good logging infrastructure looks like and what the key ingredients are, and demonstrate how Fluentd, an open source data collector, can be used to build a unified, robust logging layer.
Programming is hard, but we can magnify our efforts with excellent API design. Let’s explore how, as we consider compactness, orthogonality, consistency, safety, coupling, state handling, layering, and more, illustrated with practical examples (and gruesome mistakes!) from several popular Python libraries.
WebLion Hosting: Leveraging Laziness, Impatience, and HubrisErik Rose
Behind the scenes of WebLion's Plone hosting service, which uses Debian packages and a custom repository to deliver reliable, unattended updates to a cluster of heterogeneous departmental virtual servers. And it's all available for your own use for free.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
What happens when firefox crashes?
1. What Happens When
Firefox Crashes?or
It’s Not My Fault Tolerance
by Erik Rose
Welcome!
[Erik Rose (if not introduced)]
write server-side code @ Mozilla
to tell you about the Big Data systems behind FF crash reporting
•! ❑ ! A browser is a complex piece of software.
•! ❑ ! Challenging to test it
▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware.
! •! ❑ ! Even unique timings of your setup can trigger bugs.
! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things
! •! ❑ ***Any of which could make FF explode***
▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting.
So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
2. What Happens When
Firefox Crashes?or
It’s Not My Fault Tolerance
by Erik Rose
Welcome!
[Erik Rose (if not introduced)]
write server-side code @ Mozilla
to tell you about the Big Data systems behind FF crash reporting
•! ❑ ! A browser is a complex piece of software.
•! ❑ ! Challenging to test it
▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware.
! •! ❑ ! Even unique timings of your setup can trigger bugs.
! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things
! •! ❑ ***Any of which could make FF explode***
▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting.
So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
3. •!❑! If you've crashed FF, you've seen this dialog.
! ❑!If you choose to send us a crash report, we use it to…
! •!❑! find new bugs
! •!❑! decide where to concentrate our time
4. Socorro
!–! The thing that receives FF crash reports is called Socorro.
•!❑! ***Open source.***
•!❑! You can use it if you want. Very flexible.
•!❑! Used by Valve, Yandex
•!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
5. Socorro
https://github.com/
mozilla/socorro
!–! The thing that receives FF crash reports is called Socorro.
•!❑! ***Open source.***
•!❑! You can use it if you want. Very flexible.
•!❑! Used by Valve, Yandex
•!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
6. Very Large Array
Socorro, New Mexico
like that array, it receives signals from out in the universe and tries to filter out patterns from the
noise.
•!❑! 27 dish antennas, which can move to follow objects across the sky
•!❑! Socorro is a Very Large Array of slightly less expensive systems which tracks crashes
across the userbase
7. !
Big
Picture
The
Let’s take a peek behind the curtain
You’ll recognize some things you’re doing yourself,
and some other things might surprise you.
So let’s embark on our tour of Socorro!
8. ! •!❑! On its front end, it looks like this.
Public.
Don’t hide our failures
Unusual.
9. You can drill into this, to see
e.g. top crashers:
! •!❑! ***% of all crashes***
! •!❑! signature (stack trace)
! •!❑! breakdown by platform
! •!❑! ticket correllations
10. You can drill into this, to see
e.g. top crashers:
! •!❑! ***% of all crashes***
! •!❑! signature (stack trace)
! •!❑! breakdown by platform
! •!❑! ticket correllations
11. !–! Another example: explosive crashes
! !–! Music charts: "bullets"
! •!❑! song which rises quickly up the charts to suddenly become extremely popular
! •!❑! Something we expect to see as 5% of all crashes, but then you wake up one morning, and
they're 85% of all crashes.
! •!❑! Generally what this means is that one of the major sites shipped a new piece of JS which
crashes us.
! !✓! The most recent example of this is during the last Olypmics, when Google released a new
Doodle every day.
12. ! •!❑! I think it was this one that crashed us.
! •!❑! On the one hand, we knew the problem was going away tomorrow. So that’s nice.
! •!❑! OTOH, a lot of people have Google set as their startup page. So that's bad. ;-)
13. !❑! You can also find…
! •!❑! Most common crashes for a version, platform, etc.
! •!❑! New crashes
! !❑! Correlations
! •!❑! ferret out interactions between plugins, for example
•!❑! Pretty straightforward, right?
Backend is less straightforward…
14. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
15. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
16. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
17. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
18. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
19. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
800GB in PostgreSQL
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
20. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
800GB in PostgreSQL
40TB in HDFS, 110TB replicated
•!❑! Over 120 boxes, all physical.
!❑! Why physical?
! •!❑! Organizational momentum
! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.
!–!
! •!❑! How much data?
! •!❑! "The smallest big-data project"
! •!❑! Used to be considered big. Not anymore.
! !✓! Numbers
! •!✓! ***500M FF users***
! •!✓! ***150M ADUs. Probably more.***
! •!✓! ***3000 crashes/minute.*** 3M/day.
! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)
! •!✓! ***800GB*** in PG
! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.
! !✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.
! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
21. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
It all starts ***down here***, with FF.
But even that’s made up of multiple moving parts.
22. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
It all starts ***down here***, with FF.
But even that’s made up of multiple moving parts.
23. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
24. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
25. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
26. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
27. Collectors
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side
***First 2*** in FF process
! ❑! Breakpad
! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa
! ! ❑! stack dump of all threads
! •!❑! opaque; doesn't even know the frame boundaries
! •!❑! a little other processor state
! •!❑! throws it to another process: ***Crash Reporter***
Why?
Remember, FF has crashed.
State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”
binary crash dump + JSON metadata
→ POST → collectors…
29. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Collectors: super simple
Writes crashes to ***local disk…***
30. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Then, another process
on same box
31. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Crash Movers
picks up crashes off local disk
→ 2 places
32. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
1st: → HBase.
HBase is primary store for crashes.
70 nodes
At the same time***…***
33. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
IDs → Rabbit
! ❑! Soft realtime: and normal queues
! •!❑! Priority: process within 60 secs
34. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
Postgre
elasticse
Debug
symbols on
NFS
pgbou
Zeus
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
Version
Scraper
FTP
Zeu
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
!❑! Processors
! •!❑! Where the real action happens
! •!❑! To process a crash means to do what's necessary to make it visible in the web UI.
! •!❑! ID from Rabbit
! •!❑! binary → debug
! •!❑! signature generation
! •!❑! Then it puts it into buckets and adds it to PG and ES.
First, PG.
35. Zeus Ze
Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
alized
ew
ders
Users
res
ns
ness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
!❑! Postgres
! !❑! Our main interactive datastore
! •!❑! It's what the web app and most batch jobs talk to.
! !❑! Stores (cut?)
! •!❑! unique crash signatures
! •!❑! numbers of crashes, bucketed by signature
! !❑! other aggregations of crash counts on various facets
! •!❑! to make reporting fast
! •!❑! (see slide 32 of breakpad.socorro.master.key.)
! !❑! In there for a couple reasons
! •!❑! Prompt, reliable answers to queries
! !❑! Ref integ
! •!❑! Stores unique crash signatures
! •!❑! And their relationships to versions, tickets, & so on
! •!❑! PHP & Django easy to query from
Now, let’s turn around & talk about ES, which operates in parallel.
36. Zeus Ze
Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
alized
ew
ders
Users
res
ns
ness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
!❑! Elasticsearch
! •!❑! 90-day rolling window
! •!❑! Faceting
! !❑! NKOTB
•! ❑!Extremely flexible text analysis.
! ! ! •! ❑! Though geared toward natural language, we may be able to persuade it to take apart C++
call signatures & let us mine those in meaningful ways.
! !❑! May someday eat some of HBase or Postgres's lunch
! !❑! It scales out like HBase & can even execute arbitrary scripts near the data, collating & returning
data through a master node.
! •!❑! Maybe not the flexibilty of full map-reduce
! •!❑! Filter caching
! •!❑! Supports indices itself
37. Duplicate
Finder
Zeus Zeus
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
ron
obs
!❑! Web services (“middleware”)
! •!❑! At end of this story: web application
! •!❑! But between it and data is REST middleware
! !❑! Why?
! •!❑! was in PHP and we didn't want to reimplement model logic in 2 languages
! •!❑! We change datastores.
! •!❑! We move data around.
38. Duplicate
Finder
Zeus Zeus
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
ron
obs
!✓! Web App
! •!✓! Django
! •!✓! Each runs memcached
39. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
And that concludes our big-picture tour of Socorro!
Now, as years have gone by and the system has grown in scope and size,
interesting patterns
40. !
Big
Patterns
tooling was clearly missing.
standard practices weren’t good enough.
I’m going to call out some of these emergent needs and
show you our solutions.
Maybe you’ll even find some of our tools useful.
The first…
41. !
Big
Storage
Every Big Data system put everything somewhere
Solutions well-established
Amount of data you can deal with in a commoditized fashion rises every year
sharding, repl
expensive
We realized
by application of statistics
***shrink amount of data***
42. !
Big
Storage
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider
reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
43. !
Big
Storage
Sampling
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider
reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
44. !
Big
Storage
Sampling
Targeting
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider
reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
45. !
Big
Storage
Sampling
Targeting
Rarification
***sampling***
per product
all FFOS crashes
don’t wanna lose interesting rare events (due to sampling)
***targeting***
take anything with a comment
•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***
throw away uninteresting parts of stack frames
!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.
! •!❑! Sentinel frames to jump TO
! •!❑! Frames that should be ignored
An important part of making our hash buckets wider
reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of
our pipeline.
Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just
means buying more HDs.
But processors, rabbit, PG, ES, memcache, crons—all have lighter load
46. !
Big
Systems
•!❑! Big Data systems tend to be complicated systems.
•!❑! Diverse parts: not just one big 500-node HBase cluster and done
!❑! Example: 6 data stores:
! •!❑! FS
! •!❑! PG
! •!❑! ES
! •!❑! HBase
! •!❑! memcache
! •!❑! RabbitMQ
! •!❑! This is typical of architectures now. Gone are the days of 1 datastore, 1 representation.
! •!❑! 18 months ago, was hearing jokes about data mullet: relational in the front, NoSQL in the
back.
! •!❑! data dreadlocks. It's all over the place.
The kinds of problems you can have in these systems
really tough to track down
47. Hadoops!
A tale of Big Failure
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
48. Hadoops!
A tale of Big Failure
Complex interactions
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
49. Hadoops!
A tale of Big Failure
Complex interactions
Hardware matters.
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
50. Hadoops!
A tale of Big Failure
Complex interactions
Hardware matters.
Design for failure.
crash every 50 hours
***Hadoop’s cleverness*** with TCP connections
TCP stack bugs in Linux
lying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.
Remember that hardware is a nontrivial part of your system
! ❑! When you have a problem, it can be hard to work out exactly what's gone wrong.
! •!❑! Can take time to get everybody together
must keep receiving crashes.
***Boxes & springs***
51. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
The most important: ***this Local FS***
52. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
The most important: ***this Local FS***
53. Duplicate
Finder
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Debug
symbols on
NFS
pgbouncer
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
Everything else can fail
3 days of runway
Saved us several times
Yours may not look like this, but
•!❑! You could imagine a system being able to serve just out of cache if the datastore went away.
•!❑! Or operate in read-only mode if writes became unavailable.
! ! ! ! SUMO
One thing from this diagram we didn’t talk about much yet was ***cron jobs***.
54. !
Big
Batching
•!❑! Mozilla is a large project with a long legacy, and Socorro interfaces with a lot of
other systems. ***A lot of this occurs via batch jobs.***
55. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
57. In fact, you can look at a lot of our periodic tasks as a dependency tree.
One thing upstream fails***…***
58. …and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
59. …and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
60. …and downstream everything else fails.
replaced cron w/crontabber
Instead of blindly running jobs whose prerequisites aren’t filled,
runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sys
Too error-prone by hand.
***Then*** we thought: why not have crontabber draw them for us?
61.
62.
63.
64. SVGs are really neat.
can wiggle if unclear
And then break down specifics into a ***table…***
65. One job at a time atm cuz “eek matviews perf”, but a great contribution would be some
kind of shared locks or thresholds for multiple.
But you know, right now, it’s ***good enough…***
66. !
Big
Deal
And it’s surprising how often that happens. Oftentimes, your makeshift solutions end up
being good enough to do the job.
67. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***
polls HBase
→ PG
polls PG
→ processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
68. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***
polls HBase
→ PG
polls PG
→ processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
69. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***
polls HBase
→ PG
polls PG
→ processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
70. Duplicate
Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug
symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla
Associator
Automatic
Emailer
Bugzilla
Materialized
View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count
Loader
Version
Scraper
FTP Vertica
Zeus
cron
jobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)***
polls HBase
→ PG
polls PG
→ processors
***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to
have.
Or perhaps my message should be: do a good job on your temporary solutions, because they’ll
probably be around awhile.
71. definition: hook up to one computer, or fit on one desk
changes every year
The fact…wearing nearly 100GB
unimaginable to operator of punch card duplicator from only 50 years ago
But the patterns that come out of large systems remain.
Duplicate cards: why? To facet 2 ways in parallel.
While you may need to generalize a bit,
I have no doubt
techniques you learn today and tomorrow
serve you well into the future.