SlideShare a Scribd company logo
Flávio Clésio
April, 2020
Machine Learning Operations
Active Failures, Latent Conditions
MLOps Community
ABOUT ME
Flávio Clésio
• Machine Learning Engineer @ MyHammer AG
• MSc. in Production Engineering (Machine Learning in Credit
Derivatives/NPL)
• Specialist in Database Engineering and Business Intelligence
• Blogger @ flavioclesio.com
• Talked in some venues (Strata Hadoop World, Spark Summit, PAPIS.io, The
Developers Conference and so on…)
flavioclesio
CURRENT STATE OF ML SYSTEMS
Machine Learning Systems play a huge role in several businesses from the banking
industry, recommender systems until health domains.
When we talk about high stakes Machine Learning in Production we can consider that
this era of "a-data-scientist-with-a-script-in-a-single-machine" is officially over.
This talk will discuss risk assessment in ML Systems from the perspective of
reliability, operations and especially causal aspects that can lead to outages in ML
Systems.
WHAT IT’S ABOUT?
SURVIVORSHIP BIAS
Several posts, conference talks, papers most of the time they presents only what
worked extremely well, how those solutions generated revenue for the company and
other happy cases.
SURVIVORSHIP BIAS
Almost no one disclosures what went wrong during the development of these
solutions.
This is essentially a problem given that we are only seeing the final outcome and not
how that outcome was generated and the failures/errors made along the way.
● Not sexy
● People can feel blame or silly to talk about their errors
● Can turns in a "bad personal/corporate branding"
● Who died in during the process cannot tell what went wrong
FAILURE: A NOT SO ROMANTIC TOPIC
HOW IT LOOKS LIKE
● Amazon: The data on a load-balancer was
deleted and that caused a disruption in
practically an entire AWS region at the
time;
● Gitlab: A deletion of a production
database led to an 18-hour unavailability
with loss of customer data;
● Knight Capital: Lack of code review
culture allowed an engineer to deploy a
code 8 years outdated in production.
Outcome: losses of $ 172,222 per second
for 45 minutes (or U$ 465 million).
SOME SPECIFIC FAILURE CASES
● European Space Agency: A conversion
from a 16-bit to 64-bit number caused an
overflow in the rocket steering system
that triggered a chain of events that
caused the rocket to be destroyed and a
loss of more than $ 370 million
● NASA: A degradation from an engineering
culture to political/product culture led to
a catastrophic failure that not only cost
billions of dollars but also killed the crew
in the Challenger space shuttle.
FAILURE = LEARNING = OPPORTUNITY TO IMPROVE
● There is always a lesson to be learned in the face of what went wrong
● A good culture it’s not about blaming or not thinking about the problems, but to
analyse them, learn and improve.
● For every lesson learned, all system becomes more reliable
The aviation industry it’s one example where the reliability are significantly increased
for every incident/accident.
This is one of industries that becomes more reliable even with the increase of
transactions along the time; and because of that the number of fatalities has been
falling year by year.
RELIABILITY BENCHMARK
This model was created by James Reason in early 90ies as a general framework for
understanding the dynamics of accident causation.
The idea was to identify latent conditions and active failures to put in place
countermeasures to minimize a unwanted variability in human behaviour in
socio-technological systems.
Sources: “Human error: models and management” and "The contribution of latent
human failures to the breakdown of complex systems"
SWISS CHEESE MODEL
DEFENSES, BARRIERS, SAFEGUARDS
[...] High-tech systems have many defensive layers: some are designed (alarms,
physical barriers, automatic shutdowns, etc.), others rely on people (surgeons,
anesthetists, pilots, control room operators, etc.) and others rely on procedures and
administrative controls. [...]
Source: Human error: models and management
[...] In an ideal world, each defensive layer would be intact. In reality, however, they are
more like slices of Swiss cheese, with many holes - although, unlike cheese, these holes
are continually opening, closing and moving. The presence of holes in any "slice" does
not normally cause a bad result. This can usually happen only when the holes in several
layers line up momentarily to allow for an accident opportunity trajectory - bringing risks
to harmful contact with victims[...]
Source: Human error: models and management
DEFENSES, BARRIERS, SAFEGUARDS
SWISS CHEESE MODEL
Source: Understanding models of error and how they apply in clinical practice
● Local fixes are appealing because of sounds productive and look good for the
other teams (e.g. see how this person solved the problem very fast?)
● Cultivates several dormant problems and silent risks that can potentially cause
harm, instead to eliminate them
● Promotes a stagnated engineering culture instead to aim for a continuous reform
(i.e. not only cosmetic enhancements but a substantial reform)
● Local fixes it’s like have a problem with mosquitoes and keep swatting them
every day, instead to solve the problem draining the swamps in which they breed.
WHY NOT JUST FIX THE PROBLEM AND MOVE ON?
That is, in this case each slice of Swiss cheese would be a line of defense with
projected layers (e.g., monitoring, alarms, code push locks in production, etc.) and / or
the procedural layers that involve people (e.g., cultural aspects , training and
qualification of commiters in the repository, rollback mechanisms, unit and integration
tests, etc.).
LATENT CONDITIONS AND ACTIVE FAILURES
LATENT CONDITIONS
[...] Latent conditions are like a kind of situations intrinsically resident within the
system; which are consequences of design, engineering decisions, who wrote the rules
or procedures and even the highest hierarchical levels in an organization. These latent
conditions can lead to two types of adverse effects, which are situations that cause
error and the creation of vulnerabilities. That is, the solution has a design that increases
the likelihood of high negative impact events that can be equivalent to a causal or
contributing factor.[...]
Source: Human error: models and management
ACTIVE FAILURES
[...]Active failures are insecure acts or minor transgressions committed by people who
are in direct contact with the system; these acts can be mistakes, lapses, distortions,
omissions, errors and procedural violations.[...]
Source: Human error: models and management
HUMAN FACTORS
Source: Human Factors Analysis and Classification System (HFACS)
● Absence of Code Review culture (e.g. London Whale and Knight Capital)
● Culture of improvised technical arrangements (e.g. workarounds)
● Lack of observability
● Democracy-type decisions with less informed people rather than consensus
between experts and risk-takers
SOME LATENT CONDITIONS IN ML
● Resumé-Driven Development
● Unreviewed code going into production
● Data Leakage in model training
● Lack of reproducibility / replicability
● Glue code
SOME ACTIVE FAILURES IN ML
A SWISS CHEESE OF AN OUTAGE IN ML SYSTEM
● Post-Mortems
● Automation
● Monitoring & Alerts
○ Observability
■ Metrics
■ Logs
■ Application Performance Monitoring
● Continuous reform and permanent assessment
SOME STRATEGIES TO MINIMIZE RISK SURFACE
● Orchestration (e.g. Mesos, Airflow, Kubernetes, AWS ECS, Kubeflow)
● Observability (e.g. Elasticsearch, Kibana, Prometheus, Sentry, Grafana, FluentBit,
Datadog)
● ML Experiment Management ( e.g. ModelChimp, Randopt, Forge, Lore, Datmo,
Studio ML, Sacred, MLFlow, Polyaxon)
● Data Versioning and management (e.g. DVC, Pachyderm, Snorkel)
● ML SaaS (e.g. Algorithmia, Peltarion, Databricks, Seldon IO, Google AI Platform,
AWS Sage Maker, Azure ML Studio, Dotscience, Daitaku DSS, Domino AI,
Polyaxon, Weights & Biases, Spell, Gradient, Paperspace, H2O AI, Stack ML,
Comet, Valohai, Neptune AI)
SOME TOOLS
• There’s no silver bullet regarding risk management in ML Platforms. The hard
part it’s to know how perceive and manage those risks
• An outage never happens due to a single reason. Outages are several latent
conditions and active failures combined and aligned that triggers the event
• Human factors plays a huge role in outages
• If possible, share your mistakes. When someone shares what went wrong,
everyone learns and all ML systems becomes more robust.
FINAL REMARKS
THANK YOU!
@flavioclesio fclesio flavioclesioflavioclesio.com
● Reason, James. “The contribution of latent human failures to the breakdown of complex
systems.” Philosophical Transactions of the Royal Society of London. B, Biological Sciences
327.1241 (1990): 475-484.
● Reason, J. “Human error: models and management.” BMJ (Clinical research ed.) vol. 320,7237
(2000): 768-70. doi:10.1136/bmj.320.7237.768
● Morgenthaler, J. David, et al. “Searching for build debt: Experiences managing technical debt at
Google.” 2012 Third International Workshop on Managing Technical Debt (MTD). IEEE, 2012.
● Alahdab, Mohannad, and Gül Çalıklı. “Empirical Analysis of Hidden Technical Debt Patterns in
Machine Learning Software.” International Conference on Product-Focused Software Process
Improvement. Springer, Cham, 2019.
REFERENCES
● Perneger, Thomas V. “The Swiss cheese model of safety incidents: are there holes in the
metaphor?.” BMC health services research vol. 5 71. 9 Nov. 2005, doi:10.1186/1472-6963-5-71
● “Hot cheese: a processed Swiss cheese model.” JR Coll Physicians Edinb 44 (2014): 116-21.
● Breck, Eric, et al. “What’s your ML Test Score? A rubric for ML production systems.” (2016).
● SEC Charges Knight Capital With Violations of Market Access Rule
● Machine Learning Goes Production! Engineering, Maintenance Cost, Technical Debt, Applied Data
Analysis Lab Seminar
REFERENCES
REFERENCES
● Nassim Taleb – Lectures on Fat Tails, (Anti)Fragility, Precaution, and Asymmetric Exposures
● Skybrary – Human Factors Analysis and Classification System (HFACS)
● CEFA Aviation – Swiss Cheese Model
● A List of Post-mortems
● Richard Cook – How Complex Systems Fail
● Airbus – Hull Losses
● Number of flights performed by the global airline industry from 2004 to 2020

More Related Content

What's hot

Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
iguazio
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
Databricks
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
Saurabh Kaushik
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
Nisha Talagala
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
Stepan Pushkarev
 
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Flavio Clesio
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learningVinoth Kannan
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
Databricks
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
Nick Handel
 
NextGenML
NextGenML NextGenML
Forget becoming a Data Scientist, become a Machine Learning Engineer instead
Forget becoming a Data Scientist, become a Machine Learning Engineer insteadForget becoming a Data Scientist, become a Machine Learning Engineer instead
Forget becoming a Data Scientist, become a Machine Learning Engineer instead
Data Con LA
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Sri Ambati
 
H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...
H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...
H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...
Sri Ambati
 
Model Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on DatabricksModel Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on Databricks
Databricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
Databricks
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsEric Chiang
 
Scaling up Deep Learning by Scaling Down
Scaling up Deep Learning by Scaling DownScaling up Deep Learning by Scaling Down
Scaling up Deep Learning by Scaling Down
Databricks
 
Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale
Infrastructure Solutions for Deploying AI/ML/DL Workloads at ScaleInfrastructure Solutions for Deploying AI/ML/DL Workloads at Scale
Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale
Robb Boyd
 

What's hot (20)

Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
Spark Summit EU 2017 - Preventing revenue leakage and monitoring distributed ...
 
Real time machine learning
Real time machine learningReal time machine learning
Real time machine learning
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
 
NextGenML
NextGenML NextGenML
NextGenML
 
Forget becoming a Data Scientist, become a Machine Learning Engineer instead
Forget becoming a Data Scientist, become a Machine Learning Engineer insteadForget becoming a Data Scientist, become a Machine Learning Engineer instead
Forget becoming a Data Scientist, become a Machine Learning Engineer instead
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...
H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...
H2O Machine Learning with KNIME Analytics Platform - Christian Dietz - H2O AI...
 
Model Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on DatabricksModel Experiments Tracking and Registration using MLflow on Databricks
Model Experiments Tracking and Registration using MLflow on Databricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
Next.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev OpsNext.ml Boston: Data Science Dev Ops
Next.ml Boston: Data Science Dev Ops
 
Scaling up Deep Learning by Scaling Down
Scaling up Deep Learning by Scaling DownScaling up Deep Learning by Scaling Down
Scaling up Deep Learning by Scaling Down
 
Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale
Infrastructure Solutions for Deploying AI/ML/DL Workloads at ScaleInfrastructure Solutions for Deploying AI/ML/DL Workloads at Scale
Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale
 

Similar to Machine Learning Operations (MLOps) - Active Failures and Latent Conditions

Machine Learning Operations Active Failures, Latent Conditions
Machine Learning Operations Active Failures, Latent ConditionsMachine Learning Operations Active Failures, Latent Conditions
Machine Learning Operations Active Failures, Latent Conditions
Flavio Clesio
 
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
CARLOS III UNIVERSITY OF MADRID
 
Machine Learning Risk Management
Machine Learning Risk ManagementMachine Learning Risk Management
Machine Learning Risk Management
Andrew Clark
 
TPS Lean and Agile - Brief History and Future
TPS Lean and Agile - Brief History and FutureTPS Lean and Agile - Brief History and Future
TPS Lean and Agile - Brief History and Future
Kiro Harada
 
Ontonix Complexity Measurement and Predictive Analytics WP Oct 2013
Ontonix Complexity Measurement and Predictive Analytics WP Oct 2013Ontonix Complexity Measurement and Predictive Analytics WP Oct 2013
Ontonix Complexity Measurement and Predictive Analytics WP Oct 2013
Datonix.it
 
Eliminate 7 Mudas
Eliminate 7 MudasEliminate 7 Mudas
Eliminate 7 Mudas
Raja Nagendra Kumar
 
Is increasing entropy of information systems a fatality
Is increasing entropy of information systems a fatalityIs increasing entropy of information systems a fatality
Is increasing entropy of information systems a fatality
René MANDEL
 
KMME 2014 Douglas Weidner
KMME 2014 Douglas WeidnerKMME 2014 Douglas Weidner
KMME 2014 Douglas Weidner
KMMiddleEast
 
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
Mark Underwood
 
1-SoftwareEngineeringandBestPractices.ppt
1-SoftwareEngineeringandBestPractices.ppt1-SoftwareEngineeringandBestPractices.ppt
1-SoftwareEngineeringandBestPractices.ppt
MaheshMutnale1
 
Resilience Engineering & Human Error... in IT
Resilience Engineering & Human Error... in ITResilience Engineering & Human Error... in IT
Resilience Engineering & Human Error... in IT
João Miranda
 
INCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems EngineeringINCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems Engineering
CARLOS III UNIVERSITY OF MADRID
 
421 672 Management Of Technological Enterprises(2008 Tutorial 1)
421 672 Management Of Technological Enterprises(2008   Tutorial 1)421 672 Management Of Technological Enterprises(2008   Tutorial 1)
421 672 Management Of Technological Enterprises(2008 Tutorial 1)William Hall
 
Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...
CARLOS III UNIVERSITY OF MADRID
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
Tao Xie
 
World's Most Influential Leaders Inspiring The Tech World, 2024
World's Most Influential Leaders Inspiring The Tech World, 2024World's Most Influential Leaders Inspiring The Tech World, 2024
World's Most Influential Leaders Inspiring The Tech World, 2024
Worlds Leaders Magazine
 
How Did We End up Here?
 How Did We End up Here? How Did We End up Here?
How Did We End up Here?
C4Media
 
BLOCKISH SCAM EXPOSURE USING AUTOMATION LEARNING
BLOCKISH SCAM EXPOSURE USING AUTOMATION LEARNINGBLOCKISH SCAM EXPOSURE USING AUTOMATION LEARNING
BLOCKISH SCAM EXPOSURE USING AUTOMATION LEARNING
IRJET Journal
 
MelissaJarquin_Portfolio extract_compressed
MelissaJarquin_Portfolio extract_compressedMelissaJarquin_Portfolio extract_compressed
MelissaJarquin_Portfolio extract_compressedMelissa Jarquin
 
Koosis on Risk & Innovation
Koosis on Risk & InnovationKoosis on Risk & Innovation
Koosis on Risk & InnovationDavid Koosis
 

Similar to Machine Learning Operations (MLOps) - Active Failures and Latent Conditions (20)

Machine Learning Operations Active Failures, Latent Conditions
Machine Learning Operations Active Failures, Latent ConditionsMachine Learning Operations Active Failures, Latent Conditions
Machine Learning Operations Active Failures, Latent Conditions
 
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
 
Machine Learning Risk Management
Machine Learning Risk ManagementMachine Learning Risk Management
Machine Learning Risk Management
 
TPS Lean and Agile - Brief History and Future
TPS Lean and Agile - Brief History and FutureTPS Lean and Agile - Brief History and Future
TPS Lean and Agile - Brief History and Future
 
Ontonix Complexity Measurement and Predictive Analytics WP Oct 2013
Ontonix Complexity Measurement and Predictive Analytics WP Oct 2013Ontonix Complexity Measurement and Predictive Analytics WP Oct 2013
Ontonix Complexity Measurement and Predictive Analytics WP Oct 2013
 
Eliminate 7 Mudas
Eliminate 7 MudasEliminate 7 Mudas
Eliminate 7 Mudas
 
Is increasing entropy of information systems a fatality
Is increasing entropy of information systems a fatalityIs increasing entropy of information systems a fatality
Is increasing entropy of information systems a fatality
 
KMME 2014 Douglas Weidner
KMME 2014 Douglas WeidnerKMME 2014 Douglas Weidner
KMME 2014 Douglas Weidner
 
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
 
1-SoftwareEngineeringandBestPractices.ppt
1-SoftwareEngineeringandBestPractices.ppt1-SoftwareEngineeringandBestPractices.ppt
1-SoftwareEngineeringandBestPractices.ppt
 
Resilience Engineering & Human Error... in IT
Resilience Engineering & Human Error... in ITResilience Engineering & Human Error... in IT
Resilience Engineering & Human Error... in IT
 
INCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems EngineeringINCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems Engineering
 
421 672 Management Of Technological Enterprises(2008 Tutorial 1)
421 672 Management Of Technological Enterprises(2008   Tutorial 1)421 672 Management Of Technological Enterprises(2008   Tutorial 1)
421 672 Management Of Technological Enterprises(2008 Tutorial 1)
 
Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 
World's Most Influential Leaders Inspiring The Tech World, 2024
World's Most Influential Leaders Inspiring The Tech World, 2024World's Most Influential Leaders Inspiring The Tech World, 2024
World's Most Influential Leaders Inspiring The Tech World, 2024
 
How Did We End up Here?
 How Did We End up Here? How Did We End up Here?
How Did We End up Here?
 
BLOCKISH SCAM EXPOSURE USING AUTOMATION LEARNING
BLOCKISH SCAM EXPOSURE USING AUTOMATION LEARNINGBLOCKISH SCAM EXPOSURE USING AUTOMATION LEARNING
BLOCKISH SCAM EXPOSURE USING AUTOMATION LEARNING
 
MelissaJarquin_Portfolio extract_compressed
MelissaJarquin_Portfolio extract_compressedMelissaJarquin_Portfolio extract_compressed
MelissaJarquin_Portfolio extract_compressed
 
Koosis on Risk & Innovation
Koosis on Risk & InnovationKoosis on Risk & Innovation
Koosis on Risk & Innovation
 

More from Flavio Clesio

Security in Machine Learning
Security in Machine LearningSecurity in Machine Learning
Security in Machine Learning
Flavio Clesio
 
Apache Spark: Casos de uso e escalabilidade
Apache Spark: Casos de uso e escalabilidadeApache Spark: Casos de uso e escalabilidade
Apache Spark: Casos de uso e escalabilidade
Flavio Clesio
 
SP Big Data Meetup - March/16
SP Big Data Meetup - March/16SP Big Data Meetup - March/16
SP Big Data Meetup - March/16
Flavio Clesio
 
Loren seagrave neuro biomechanics of maximum velocity sprinting
Loren seagrave neuro biomechanics of maximum velocity sprintingLoren seagrave neuro biomechanics of maximum velocity sprinting
Loren seagrave neuro biomechanics of maximum velocity sprintingFlavio Clesio
 
Tom tellez sprinting a biomechanical approach
Tom tellez sprinting   a biomechanical approachTom tellez sprinting   a biomechanical approach
Tom tellez sprinting a biomechanical approachFlavio Clesio
 
Dan pfaff - guidelines for plyometric training
Dan pfaff - guidelines for plyometric trainingDan pfaff - guidelines for plyometric training
Dan pfaff - guidelines for plyometric training
Flavio Clesio
 
Mini Atletismo
Mini AtletismoMini Atletismo
Mini Atletismo
Flavio Clesio
 
Planilha De Treinos - 100 Metros
Planilha De Treinos - 100 MetrosPlanilha De Treinos - 100 Metros
Planilha De Treinos - 100 Metros
Flavio Clesio
 

More from Flavio Clesio (8)

Security in Machine Learning
Security in Machine LearningSecurity in Machine Learning
Security in Machine Learning
 
Apache Spark: Casos de uso e escalabilidade
Apache Spark: Casos de uso e escalabilidadeApache Spark: Casos de uso e escalabilidade
Apache Spark: Casos de uso e escalabilidade
 
SP Big Data Meetup - March/16
SP Big Data Meetup - March/16SP Big Data Meetup - March/16
SP Big Data Meetup - March/16
 
Loren seagrave neuro biomechanics of maximum velocity sprinting
Loren seagrave neuro biomechanics of maximum velocity sprintingLoren seagrave neuro biomechanics of maximum velocity sprinting
Loren seagrave neuro biomechanics of maximum velocity sprinting
 
Tom tellez sprinting a biomechanical approach
Tom tellez sprinting   a biomechanical approachTom tellez sprinting   a biomechanical approach
Tom tellez sprinting a biomechanical approach
 
Dan pfaff - guidelines for plyometric training
Dan pfaff - guidelines for plyometric trainingDan pfaff - guidelines for plyometric training
Dan pfaff - guidelines for plyometric training
 
Mini Atletismo
Mini AtletismoMini Atletismo
Mini Atletismo
 
Planilha De Treinos - 100 Metros
Planilha De Treinos - 100 MetrosPlanilha De Treinos - 100 Metros
Planilha De Treinos - 100 Metros
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 

Machine Learning Operations (MLOps) - Active Failures and Latent Conditions

  • 1. Flávio Clésio April, 2020 Machine Learning Operations Active Failures, Latent Conditions MLOps Community
  • 2. ABOUT ME Flávio Clésio • Machine Learning Engineer @ MyHammer AG • MSc. in Production Engineering (Machine Learning in Credit Derivatives/NPL) • Specialist in Database Engineering and Business Intelligence • Blogger @ flavioclesio.com • Talked in some venues (Strata Hadoop World, Spark Summit, PAPIS.io, The Developers Conference and so on…) flavioclesio
  • 3. CURRENT STATE OF ML SYSTEMS Machine Learning Systems play a huge role in several businesses from the banking industry, recommender systems until health domains. When we talk about high stakes Machine Learning in Production we can consider that this era of "a-data-scientist-with-a-script-in-a-single-machine" is officially over.
  • 4. This talk will discuss risk assessment in ML Systems from the perspective of reliability, operations and especially causal aspects that can lead to outages in ML Systems. WHAT IT’S ABOUT?
  • 5. SURVIVORSHIP BIAS Several posts, conference talks, papers most of the time they presents only what worked extremely well, how those solutions generated revenue for the company and other happy cases.
  • 6. SURVIVORSHIP BIAS Almost no one disclosures what went wrong during the development of these solutions. This is essentially a problem given that we are only seeing the final outcome and not how that outcome was generated and the failures/errors made along the way.
  • 7. ● Not sexy ● People can feel blame or silly to talk about their errors ● Can turns in a "bad personal/corporate branding" ● Who died in during the process cannot tell what went wrong FAILURE: A NOT SO ROMANTIC TOPIC
  • 9. ● Amazon: The data on a load-balancer was deleted and that caused a disruption in practically an entire AWS region at the time; ● Gitlab: A deletion of a production database led to an 18-hour unavailability with loss of customer data; ● Knight Capital: Lack of code review culture allowed an engineer to deploy a code 8 years outdated in production. Outcome: losses of $ 172,222 per second for 45 minutes (or U$ 465 million). SOME SPECIFIC FAILURE CASES ● European Space Agency: A conversion from a 16-bit to 64-bit number caused an overflow in the rocket steering system that triggered a chain of events that caused the rocket to be destroyed and a loss of more than $ 370 million ● NASA: A degradation from an engineering culture to political/product culture led to a catastrophic failure that not only cost billions of dollars but also killed the crew in the Challenger space shuttle.
  • 10. FAILURE = LEARNING = OPPORTUNITY TO IMPROVE ● There is always a lesson to be learned in the face of what went wrong ● A good culture it’s not about blaming or not thinking about the problems, but to analyse them, learn and improve. ● For every lesson learned, all system becomes more reliable
  • 11. The aviation industry it’s one example where the reliability are significantly increased for every incident/accident. This is one of industries that becomes more reliable even with the increase of transactions along the time; and because of that the number of fatalities has been falling year by year. RELIABILITY BENCHMARK
  • 12. This model was created by James Reason in early 90ies as a general framework for understanding the dynamics of accident causation. The idea was to identify latent conditions and active failures to put in place countermeasures to minimize a unwanted variability in human behaviour in socio-technological systems. Sources: “Human error: models and management” and "The contribution of latent human failures to the breakdown of complex systems" SWISS CHEESE MODEL
  • 13. DEFENSES, BARRIERS, SAFEGUARDS [...] High-tech systems have many defensive layers: some are designed (alarms, physical barriers, automatic shutdowns, etc.), others rely on people (surgeons, anesthetists, pilots, control room operators, etc.) and others rely on procedures and administrative controls. [...] Source: Human error: models and management
  • 14. [...] In an ideal world, each defensive layer would be intact. In reality, however, they are more like slices of Swiss cheese, with many holes - although, unlike cheese, these holes are continually opening, closing and moving. The presence of holes in any "slice" does not normally cause a bad result. This can usually happen only when the holes in several layers line up momentarily to allow for an accident opportunity trajectory - bringing risks to harmful contact with victims[...] Source: Human error: models and management DEFENSES, BARRIERS, SAFEGUARDS
  • 15. SWISS CHEESE MODEL Source: Understanding models of error and how they apply in clinical practice
  • 16. ● Local fixes are appealing because of sounds productive and look good for the other teams (e.g. see how this person solved the problem very fast?) ● Cultivates several dormant problems and silent risks that can potentially cause harm, instead to eliminate them ● Promotes a stagnated engineering culture instead to aim for a continuous reform (i.e. not only cosmetic enhancements but a substantial reform) ● Local fixes it’s like have a problem with mosquitoes and keep swatting them every day, instead to solve the problem draining the swamps in which they breed. WHY NOT JUST FIX THE PROBLEM AND MOVE ON?
  • 17. That is, in this case each slice of Swiss cheese would be a line of defense with projected layers (e.g., monitoring, alarms, code push locks in production, etc.) and / or the procedural layers that involve people (e.g., cultural aspects , training and qualification of commiters in the repository, rollback mechanisms, unit and integration tests, etc.). LATENT CONDITIONS AND ACTIVE FAILURES
  • 18. LATENT CONDITIONS [...] Latent conditions are like a kind of situations intrinsically resident within the system; which are consequences of design, engineering decisions, who wrote the rules or procedures and even the highest hierarchical levels in an organization. These latent conditions can lead to two types of adverse effects, which are situations that cause error and the creation of vulnerabilities. That is, the solution has a design that increases the likelihood of high negative impact events that can be equivalent to a causal or contributing factor.[...] Source: Human error: models and management
  • 19. ACTIVE FAILURES [...]Active failures are insecure acts or minor transgressions committed by people who are in direct contact with the system; these acts can be mistakes, lapses, distortions, omissions, errors and procedural violations.[...] Source: Human error: models and management
  • 20. HUMAN FACTORS Source: Human Factors Analysis and Classification System (HFACS)
  • 21. ● Absence of Code Review culture (e.g. London Whale and Knight Capital) ● Culture of improvised technical arrangements (e.g. workarounds) ● Lack of observability ● Democracy-type decisions with less informed people rather than consensus between experts and risk-takers SOME LATENT CONDITIONS IN ML
  • 22. ● Resumé-Driven Development ● Unreviewed code going into production ● Data Leakage in model training ● Lack of reproducibility / replicability ● Glue code SOME ACTIVE FAILURES IN ML
  • 23. A SWISS CHEESE OF AN OUTAGE IN ML SYSTEM
  • 24. ● Post-Mortems ● Automation ● Monitoring & Alerts ○ Observability ■ Metrics ■ Logs ■ Application Performance Monitoring ● Continuous reform and permanent assessment SOME STRATEGIES TO MINIMIZE RISK SURFACE
  • 25. ● Orchestration (e.g. Mesos, Airflow, Kubernetes, AWS ECS, Kubeflow) ● Observability (e.g. Elasticsearch, Kibana, Prometheus, Sentry, Grafana, FluentBit, Datadog) ● ML Experiment Management ( e.g. ModelChimp, Randopt, Forge, Lore, Datmo, Studio ML, Sacred, MLFlow, Polyaxon) ● Data Versioning and management (e.g. DVC, Pachyderm, Snorkel) ● ML SaaS (e.g. Algorithmia, Peltarion, Databricks, Seldon IO, Google AI Platform, AWS Sage Maker, Azure ML Studio, Dotscience, Daitaku DSS, Domino AI, Polyaxon, Weights & Biases, Spell, Gradient, Paperspace, H2O AI, Stack ML, Comet, Valohai, Neptune AI) SOME TOOLS
  • 26. • There’s no silver bullet regarding risk management in ML Platforms. The hard part it’s to know how perceive and manage those risks • An outage never happens due to a single reason. Outages are several latent conditions and active failures combined and aligned that triggers the event • Human factors plays a huge role in outages • If possible, share your mistakes. When someone shares what went wrong, everyone learns and all ML systems becomes more robust. FINAL REMARKS
  • 27. THANK YOU! @flavioclesio fclesio flavioclesioflavioclesio.com
  • 28. ● Reason, James. “The contribution of latent human failures to the breakdown of complex systems.” Philosophical Transactions of the Royal Society of London. B, Biological Sciences 327.1241 (1990): 475-484. ● Reason, J. “Human error: models and management.” BMJ (Clinical research ed.) vol. 320,7237 (2000): 768-70. doi:10.1136/bmj.320.7237.768 ● Morgenthaler, J. David, et al. “Searching for build debt: Experiences managing technical debt at Google.” 2012 Third International Workshop on Managing Technical Debt (MTD). IEEE, 2012. ● Alahdab, Mohannad, and Gül Çalıklı. “Empirical Analysis of Hidden Technical Debt Patterns in Machine Learning Software.” International Conference on Product-Focused Software Process Improvement. Springer, Cham, 2019. REFERENCES
  • 29. ● Perneger, Thomas V. “The Swiss cheese model of safety incidents: are there holes in the metaphor?.” BMC health services research vol. 5 71. 9 Nov. 2005, doi:10.1186/1472-6963-5-71 ● “Hot cheese: a processed Swiss cheese model.” JR Coll Physicians Edinb 44 (2014): 116-21. ● Breck, Eric, et al. “What’s your ML Test Score? A rubric for ML production systems.” (2016). ● SEC Charges Knight Capital With Violations of Market Access Rule ● Machine Learning Goes Production! Engineering, Maintenance Cost, Technical Debt, Applied Data Analysis Lab Seminar REFERENCES
  • 30. REFERENCES ● Nassim Taleb – Lectures on Fat Tails, (Anti)Fragility, Precaution, and Asymmetric Exposures ● Skybrary – Human Factors Analysis and Classification System (HFACS) ● CEFA Aviation – Swiss Cheese Model ● A List of Post-mortems ● Richard Cook – How Complex Systems Fail ● Airbus – Hull Losses ● Number of flights performed by the global airline industry from 2004 to 2020