Flávio Clésio
September, 2020
Machine Learning Operations
Active Failures, Latent Conditions
Open Data Science Community
ABOUT ME
Flávio Clésio
• MSc. in Production Engineering (ML in Credit Derivatives/NPL)
• Machine Learning Engineer by education, Data Scientist by profession
• Blogger @ flavioclesio.com
• Talked in some venues (Strata Hadoop World, Spark Summit, PAPIS.io, The
Developers Conference and so on…)
flavioclesio
The views expressed in this presentation are my own. They have not been reviewed or
approved by my current employer (My Hammer AG) or by past employers. I do not
speak for my current employer or any other companies and this talk doesn't have any
kind of conflict of interest.
DISCLAIMER
CURRENT STATE OF ML SYSTEMS
● Machine Learning Systems play a huge role in several businesses from the
banking industry, recommender systems until health domains.
● When we talk about high stakes Machine Learning in Production we can consider
that this era of "a-data-scientist-with-a-script-in-a-single-machine" is officially
over.
This talk will discuss risk assessment in ML Systems from the perspective of
reliability, operations and especially causal aspects that can lead to outages in ML
Systems.
WHAT IT’S ABOUT?
SURVIVORSHIP BIAS
Several posts, conference talks, papers most of the time they present only what
worked extremely well, how those solutions generated revenue for the company, and
other happy cases.
SURVIVORSHIP BIAS
● Almost no one disclosures what went wrong during the development of these
solutions.
● This is essentially a problem given that we are only seeing the final outcome and
not how that outcome was generated and the failures/errors made along the way.
● Not sexy
● People can feel blame or silly to talk about their errors
● Can turns in a "bad personal/corporate branding"
● Who died in during the process cannot tell what went wrong
FAILURE: A NOT SO ROMANTIC TOPIC
HOW IT LOOKS LIKE
● Amazon: The data on a load-balancer was
deleted and that caused a disruption in
practically an entire AWS region at the
time;
● Gitlab: A deletion of a production
database led an 18-hour unavailability
with loss of customer data;
● Knight Capital: Lack of code review
culture allowed an engineer to deploy a
code 8 years outdated in production.
Outcome: losses of $ 172,222 per second
for 45 minutes (or U$ 465 million).
SOME SPECIFIC FAILURE CASES
● European Space Agency: A conversion
from a 16-bit to 64-bit number caused an
overflow in the rocket steering system
that triggered a chain of events that
caused the rocket to be destroyed and a
loss of more than $ 370 million
● NASA: A degradation from an engineering
culture to political/product culture led to
a catastrophic failure that not only cost
billions of dollars but also killed the crew
in the Challenger space shuttle.
FAILURE = LEARNING = OPPORTUNITY TO IMPROVE
● There is always a lesson to be learned in the face of what went wrong
● A good culture it’s not about blaming or not thinking about the problems, but to
analyze them, learn and improve
● For every lesson learned, all system becomes more reliable
RELIABILITY BENCHMARK
● The aviation industry it’s one example where the reliability is significantly
increased for every incident/accident.
● This is one of the industries that becomes more reliable even with the increase of
transactions along the time; and because of that the number of fatalities has
been falling year by year.
This model was created by James Reason in early 90ies as a general framework for
understanding the dynamics of accident causation.
The idea was to identify latent conditions and active failures to put in place
countermeasures to minimize a unwanted variability in human behaviour in
socio-technological systems.
Sources: “Human error: models and management” and "The contribution of latent human failures to the breakdown of complex systems"
SWISS CHEESE MODEL
DEFENSES, BARRIERS, SAFEGUARDS
[...] High-tech systems have many defensive layers: some are designed (alarms,
physical barriers, automatic shutdowns, etc.), others rely on people (surgeons,
anesthetists, pilots, control room operators, etc.) and others rely on procedures and
administrative controls. [...]
Source: Human error: models and management
[...] In an ideal world, each defensive layer would be intact. In reality, however, they are
more like slices of Swiss cheese, with many holes - although, unlike cheese, these holes
are continually opening, closing and moving. The presence of holes in any "slice" does
not normally cause a bad result. This can usually happen only when the holes in several
layers line up momentarily to allow for an accident opportunity trajectory - bringing risks
to harmful contact with victims[...]
Source: Human error: models and management
DEFENSES, BARRIERS, SAFEGUARDS
SWISS CHEESE MODEL
Source: Understanding models of error and how they apply in clinical practice
● Local fixes are appealing because of sounds productive and look good for the
other teams (e.g. see how this person solved the problem very fast?)
● Cultivates several dormant problems and silent risks that can potentially cause
harm, instead to eliminate them
● Promotes a stagnated engineering culture instead to aim for a continuous reform
(i.e. not only cosmetic enhancements but a substantial reform)
● Local fixes it’s like have a problem with mosquitoes and keep swatting them
every day, instead to solve the problem draining the swamps in which they
breed.(James Reason)
WHY NOT JUST FIX THE PROBLEM AND MOVE ON?
That is, in this case each slice of Swiss cheese would be a line of defense with
projected layers (e.g., monitoring, alarms, code push locks in production, etc.) and / or
the procedural layers that involve people (e.g., cultural aspects , training and
qualification, rollback mechanisms, unit and integration tests, etc.).
LATENT CONDITIONS AND ACTIVE FAILURES
LATENT CONDITIONS
[...] Latent conditions are like a kind of situations intrinsically resident within the
system; which are consequences of design, engineering decisions, who wrote the rules
or procedures and even the highest hierarchical levels in an organization. These latent
conditions can lead to two types of adverse effects, which are situations that cause
error and the creation of vulnerabilities. That is, the solution has a design that increases
the likelihood of high negative impact events that can be equivalent to a causal or
contributing factor.[...]
Source: Human error: models and management
ACTIVE FAILURES
[...]Active failures are insecure acts or minor transgressions committed by people who
are in direct contact with the system; these acts can be mistakes, lapses, distortions,
omissions, errors and procedural violations.[...]
Source: Human error: models and management
HUMAN FACTORS
Source: Human Factors Analysis and Classification System (HFACS)
● Absence of Code Review culture (e.g. London Whale and Knight Capital)
● Culture of improvised technical arrangements (e.g. workarounds)
● Lack of observability
● Democracy-type decisions with less informed people rather than consensus
between experts and risk-takers
SOME LATENT CONDITIONS IN ML
● Resumé-Driven Development
● Unreviewed code going into production
● Data Leakage in model training
● Lack of reproducibility / replicability
● Glue code
SOME ACTIVE FAILURES IN ML
A SWISS CHEESE OF AN OUTAGE IN ML SYSTEM
● A commitment to principles of operational excellence
● Post-Mortems
● Automation
● Monitoring & Alerts
○ Observability
■ Metrics
■ Logs
■ Application Performance Monitoring
● Continuous reform and permanent assessment
SOME STRATEGIES TO MINIMIZE RISK SURFACE
● More testing (Unit, integration, regression, etc.)
● Increase in the lead time for the delivery of those systems
● More time to design
● Increase in complexity
● Costs related to redundancy (e.g. backups, replication, load balancing, failover,
elastic setups, etc.)
● Cultural changes (i.e. Especially engineering principles)
MAKING ML SYSTEMS MORE FAULT-TOLERANT HAVE
SOME PENALTIES
● Orchestration (e.g. Mesos, Airflow, Kubernetes, AWS ECS, Kubeflow)
● Observability (e.g. Elasticsearch, Kibana, Prometheus, Sentry, Grafana, FluentBit,
Datadog)
● ML Experiment Management ( e.g. ModelChimp, Randopt, Forge, Lore, Datmo,
Studio ML, Sacred, MLFlow, Polyaxon)
● Data Versioning and management (e.g. DVC, Pachyderm, Snorkel)
● ML SaaS (e.g. Algorithmia, Peltarion, Databricks, Seldon IO, Google AI Platform,
AWS Sage Maker, Azure ML Studio, Dotscience, Daitaku DSS, Domino AI,
Polyaxon, Weights & Biases, Spell, Gradient, Paperspace, H2O AI, Stack ML,
Comet, Valohai, Neptune AI)
SOME TOOLS
● There’s no silver bullet regarding risk management in ML Platforms. The hard
part it’s to know how to perceive and manage those risks
● An outage never happen due to a single reason. Outages are several latent
conditions and active failures combined and aligned that triggers the event
● Human factors play a huge role in outages
● Fault-free systems are not for free
● If possible, share your mistakes. When someone shares what went wrong,
everyone learns and all ML systems become more robust.
FINAL REMARKS
THANK YOU!
@flavioclesio fclesio flavioclesioflavioclesio.com
Reason, James. “The contribution of latent human failures to the breakdown of complex systems.”
Philosophical Transactions of the Royal Society of London. B, Biological Sciences 327.1241 (1990):
475-484.
Reason, J. “Human error: models and management.” BMJ (Clinical research ed.) vol. 320,7237 (2000):
768-70. doi:10.1136/bmj.320.7237.768
Morgenthaler, J. David, et al. “Searching for build debt: Experiences managing technical debt at
Google.” 2012 Third International Workshop on Managing Technical Debt (MTD). IEEE, 2012.
Alahdab, Mohannad, and Gül Çalıklı. “Empirical Analysis of Hidden Technical Debt Patterns in Machine
Learning Software.” International Conference on Product-Focused Software Process Improvement.
Springer, Cham, 2019.
REFERENCES
Perneger, Thomas V. “The Swiss cheese model of safety incidents: are there holes in the metaphor?.”
BMC health services research vol. 5 71. 9 Nov. 2005, doi:10.1186/1472-6963-5-71
“Hot cheese: a processed Swiss cheese model.” JR Coll Physicians Edinb 44 (2014): 116-21.
Breck, Eric, et al. “What’s your ML Test Score? A rubric for ML production systems.” (2016).
SEC Charges Knight Capital With Violations of Market Access Rule
Machine Learning Goes Production! Engineering, Maintenance Cost, Technical Debt, Applied Data
Analysis Lab Seminar
REFERENCES
REFERENCES
Nassim Taleb – Lectures on Fat Tails, (Anti)Fragility, Precaution, and Asymmetric Exposures
Skybrary – Human Factors Analysis and Classification System (HFACS)
CEFA Aviation – Swiss Cheese Model
A List of Post-mortems
Richard Cook – How Complex Systems Fail
Airbus – Hull Losses
Number of flights performed by the global airline industry from 2004 to 2020

Machine Learning Operations Active Failures, Latent Conditions

  • 1.
    Flávio Clésio September, 2020 MachineLearning Operations Active Failures, Latent Conditions Open Data Science Community
  • 2.
    ABOUT ME Flávio Clésio •MSc. in Production Engineering (ML in Credit Derivatives/NPL) • Machine Learning Engineer by education, Data Scientist by profession • Blogger @ flavioclesio.com • Talked in some venues (Strata Hadoop World, Spark Summit, PAPIS.io, The Developers Conference and so on…) flavioclesio
  • 3.
    The views expressedin this presentation are my own. They have not been reviewed or approved by my current employer (My Hammer AG) or by past employers. I do not speak for my current employer or any other companies and this talk doesn't have any kind of conflict of interest. DISCLAIMER
  • 4.
    CURRENT STATE OFML SYSTEMS ● Machine Learning Systems play a huge role in several businesses from the banking industry, recommender systems until health domains. ● When we talk about high stakes Machine Learning in Production we can consider that this era of "a-data-scientist-with-a-script-in-a-single-machine" is officially over.
  • 5.
    This talk willdiscuss risk assessment in ML Systems from the perspective of reliability, operations and especially causal aspects that can lead to outages in ML Systems. WHAT IT’S ABOUT?
  • 6.
    SURVIVORSHIP BIAS Several posts,conference talks, papers most of the time they present only what worked extremely well, how those solutions generated revenue for the company, and other happy cases.
  • 7.
    SURVIVORSHIP BIAS ● Almostno one disclosures what went wrong during the development of these solutions. ● This is essentially a problem given that we are only seeing the final outcome and not how that outcome was generated and the failures/errors made along the way.
  • 8.
    ● Not sexy ●People can feel blame or silly to talk about their errors ● Can turns in a "bad personal/corporate branding" ● Who died in during the process cannot tell what went wrong FAILURE: A NOT SO ROMANTIC TOPIC
  • 9.
  • 10.
    ● Amazon: Thedata on a load-balancer was deleted and that caused a disruption in practically an entire AWS region at the time; ● Gitlab: A deletion of a production database led an 18-hour unavailability with loss of customer data; ● Knight Capital: Lack of code review culture allowed an engineer to deploy a code 8 years outdated in production. Outcome: losses of $ 172,222 per second for 45 minutes (or U$ 465 million). SOME SPECIFIC FAILURE CASES ● European Space Agency: A conversion from a 16-bit to 64-bit number caused an overflow in the rocket steering system that triggered a chain of events that caused the rocket to be destroyed and a loss of more than $ 370 million ● NASA: A degradation from an engineering culture to political/product culture led to a catastrophic failure that not only cost billions of dollars but also killed the crew in the Challenger space shuttle.
  • 11.
    FAILURE = LEARNING= OPPORTUNITY TO IMPROVE ● There is always a lesson to be learned in the face of what went wrong ● A good culture it’s not about blaming or not thinking about the problems, but to analyze them, learn and improve ● For every lesson learned, all system becomes more reliable
  • 12.
    RELIABILITY BENCHMARK ● Theaviation industry it’s one example where the reliability is significantly increased for every incident/accident. ● This is one of the industries that becomes more reliable even with the increase of transactions along the time; and because of that the number of fatalities has been falling year by year.
  • 13.
    This model wascreated by James Reason in early 90ies as a general framework for understanding the dynamics of accident causation. The idea was to identify latent conditions and active failures to put in place countermeasures to minimize a unwanted variability in human behaviour in socio-technological systems. Sources: “Human error: models and management” and "The contribution of latent human failures to the breakdown of complex systems" SWISS CHEESE MODEL
  • 14.
    DEFENSES, BARRIERS, SAFEGUARDS [...]High-tech systems have many defensive layers: some are designed (alarms, physical barriers, automatic shutdowns, etc.), others rely on people (surgeons, anesthetists, pilots, control room operators, etc.) and others rely on procedures and administrative controls. [...] Source: Human error: models and management
  • 15.
    [...] In anideal world, each defensive layer would be intact. In reality, however, they are more like slices of Swiss cheese, with many holes - although, unlike cheese, these holes are continually opening, closing and moving. The presence of holes in any "slice" does not normally cause a bad result. This can usually happen only when the holes in several layers line up momentarily to allow for an accident opportunity trajectory - bringing risks to harmful contact with victims[...] Source: Human error: models and management DEFENSES, BARRIERS, SAFEGUARDS
  • 16.
    SWISS CHEESE MODEL Source:Understanding models of error and how they apply in clinical practice
  • 17.
    ● Local fixesare appealing because of sounds productive and look good for the other teams (e.g. see how this person solved the problem very fast?) ● Cultivates several dormant problems and silent risks that can potentially cause harm, instead to eliminate them ● Promotes a stagnated engineering culture instead to aim for a continuous reform (i.e. not only cosmetic enhancements but a substantial reform) ● Local fixes it’s like have a problem with mosquitoes and keep swatting them every day, instead to solve the problem draining the swamps in which they breed.(James Reason) WHY NOT JUST FIX THE PROBLEM AND MOVE ON?
  • 18.
    That is, inthis case each slice of Swiss cheese would be a line of defense with projected layers (e.g., monitoring, alarms, code push locks in production, etc.) and / or the procedural layers that involve people (e.g., cultural aspects , training and qualification, rollback mechanisms, unit and integration tests, etc.). LATENT CONDITIONS AND ACTIVE FAILURES
  • 19.
    LATENT CONDITIONS [...] Latentconditions are like a kind of situations intrinsically resident within the system; which are consequences of design, engineering decisions, who wrote the rules or procedures and even the highest hierarchical levels in an organization. These latent conditions can lead to two types of adverse effects, which are situations that cause error and the creation of vulnerabilities. That is, the solution has a design that increases the likelihood of high negative impact events that can be equivalent to a causal or contributing factor.[...] Source: Human error: models and management
  • 20.
    ACTIVE FAILURES [...]Active failuresare insecure acts or minor transgressions committed by people who are in direct contact with the system; these acts can be mistakes, lapses, distortions, omissions, errors and procedural violations.[...] Source: Human error: models and management
  • 21.
    HUMAN FACTORS Source: HumanFactors Analysis and Classification System (HFACS)
  • 22.
    ● Absence ofCode Review culture (e.g. London Whale and Knight Capital) ● Culture of improvised technical arrangements (e.g. workarounds) ● Lack of observability ● Democracy-type decisions with less informed people rather than consensus between experts and risk-takers SOME LATENT CONDITIONS IN ML
  • 23.
    ● Resumé-Driven Development ●Unreviewed code going into production ● Data Leakage in model training ● Lack of reproducibility / replicability ● Glue code SOME ACTIVE FAILURES IN ML
  • 24.
    A SWISS CHEESEOF AN OUTAGE IN ML SYSTEM
  • 25.
    ● A commitmentto principles of operational excellence ● Post-Mortems ● Automation ● Monitoring & Alerts ○ Observability ■ Metrics ■ Logs ■ Application Performance Monitoring ● Continuous reform and permanent assessment SOME STRATEGIES TO MINIMIZE RISK SURFACE
  • 26.
    ● More testing(Unit, integration, regression, etc.) ● Increase in the lead time for the delivery of those systems ● More time to design ● Increase in complexity ● Costs related to redundancy (e.g. backups, replication, load balancing, failover, elastic setups, etc.) ● Cultural changes (i.e. Especially engineering principles) MAKING ML SYSTEMS MORE FAULT-TOLERANT HAVE SOME PENALTIES
  • 27.
    ● Orchestration (e.g.Mesos, Airflow, Kubernetes, AWS ECS, Kubeflow) ● Observability (e.g. Elasticsearch, Kibana, Prometheus, Sentry, Grafana, FluentBit, Datadog) ● ML Experiment Management ( e.g. ModelChimp, Randopt, Forge, Lore, Datmo, Studio ML, Sacred, MLFlow, Polyaxon) ● Data Versioning and management (e.g. DVC, Pachyderm, Snorkel) ● ML SaaS (e.g. Algorithmia, Peltarion, Databricks, Seldon IO, Google AI Platform, AWS Sage Maker, Azure ML Studio, Dotscience, Daitaku DSS, Domino AI, Polyaxon, Weights & Biases, Spell, Gradient, Paperspace, H2O AI, Stack ML, Comet, Valohai, Neptune AI) SOME TOOLS
  • 28.
    ● There’s nosilver bullet regarding risk management in ML Platforms. The hard part it’s to know how to perceive and manage those risks ● An outage never happen due to a single reason. Outages are several latent conditions and active failures combined and aligned that triggers the event ● Human factors play a huge role in outages ● Fault-free systems are not for free ● If possible, share your mistakes. When someone shares what went wrong, everyone learns and all ML systems become more robust. FINAL REMARKS
  • 29.
    THANK YOU! @flavioclesio fclesioflavioclesioflavioclesio.com
  • 30.
    Reason, James. “Thecontribution of latent human failures to the breakdown of complex systems.” Philosophical Transactions of the Royal Society of London. B, Biological Sciences 327.1241 (1990): 475-484. Reason, J. “Human error: models and management.” BMJ (Clinical research ed.) vol. 320,7237 (2000): 768-70. doi:10.1136/bmj.320.7237.768 Morgenthaler, J. David, et al. “Searching for build debt: Experiences managing technical debt at Google.” 2012 Third International Workshop on Managing Technical Debt (MTD). IEEE, 2012. Alahdab, Mohannad, and Gül Çalıklı. “Empirical Analysis of Hidden Technical Debt Patterns in Machine Learning Software.” International Conference on Product-Focused Software Process Improvement. Springer, Cham, 2019. REFERENCES
  • 31.
    Perneger, Thomas V.“The Swiss cheese model of safety incidents: are there holes in the metaphor?.” BMC health services research vol. 5 71. 9 Nov. 2005, doi:10.1186/1472-6963-5-71 “Hot cheese: a processed Swiss cheese model.” JR Coll Physicians Edinb 44 (2014): 116-21. Breck, Eric, et al. “What’s your ML Test Score? A rubric for ML production systems.” (2016). SEC Charges Knight Capital With Violations of Market Access Rule Machine Learning Goes Production! Engineering, Maintenance Cost, Technical Debt, Applied Data Analysis Lab Seminar REFERENCES
  • 32.
    REFERENCES Nassim Taleb –Lectures on Fat Tails, (Anti)Fragility, Precaution, and Asymmetric Exposures Skybrary – Human Factors Analysis and Classification System (HFACS) CEFA Aviation – Swiss Cheese Model A List of Post-mortems Richard Cook – How Complex Systems Fail Airbus – Hull Losses Number of flights performed by the global airline industry from 2004 to 2020