SlideShare a Scribd company logo
Copyright Third Nature, Inc.
Executive Briefing:
The black box –
interpretability,
reproducibility, and
data management
O’Reilly Artificial Intelligence
conference
London
October, 2019
Mark Madsen
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
What does management care about?
The perspective and problems
of the person responsible for
oversight of the results is
spread across the organization
and across multiple projects
▪ The owner of a decision or the
outcome of the decision. Aka the
process owner
▪ The CAO, CDO, VP of analytics,
aka “your boss” if you’re a data
scientist.
Copyright Third Nature, Inc.
Repeatability
Copyright Third Nature, Inc.
Operational predictability
Reproducibility
Can you support
the answer at a
later date?
Can you
replicate the
process used to
get the answer?
Can you get
similar results
on the same or
similar input?
Copyright Third Nature, Inc.
Scientific
reproducibility
Can you get the same
results given the same
starting conditions?
▪ Duplicate as much as
possible the experiment.
▪ Results of experimental
replication may not be the
same for many reasons
▪ Detailed statistical analysis
is needed
▪ And critical assessment of
experiments
Copyright Third Nature, Inc.
Reproducibility in
data science
Can you get the same
results on the same
data?
▪ Direct replication is
expected
▪ Assumption is that
same input = same
output
▪ You can have this with
an unexplainable box
▪ But there are also
confounding factors
Copyright Third Nature, Inc.
Moving from predictable rule-based systems to complex
mathematical systems, and from there to systems that
exhibit stochasticity, makes the task harder.
One thing worse than
a black box is a
random black box.
Copyright Third Nature, Inc.
If you trust the box then it’s ok not to understand it.
Without reproducibility there cannot be trust.
Trust comes from it working
It works = reliability, working over time
Reliability of your solution is a systems problem, not a
just a model problem (in the GST sense of system)
Therefore, if there is trust, or there is low risk, you
may not need to worry about reproducibility
Copyright Third Nature, Inc.
Interpretability and reproducibility are driven by trust,
which only matters when there is enough risk
Regulation and compliance
Material decisions
Big penalties
Complication: the cost of false
positives and false negatives
are usually different, so model
characteristics matter here.
Cost of error
Frequency of
decision
Low High
High
Don’t care
Oh crap
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Beyond toy examples, this problem matters
Edge case error problems will limit
AI applicability until they are solved
(don’t hold your breath).
The uncertainty challenge means:
▪ Calculate error costs and apply them
to your model before (and after)
▪ The more risk averse, the more
predictability is desire, the less
variance one can tolerate
Copyright Third Nature, Inc.
The real questions
Can I support this result at a later date?
▪ Do I care?
• Is the risk (cost) worth worrying about?
• Is the cost of reproducibility less than the risk and cost?
▪ Do I only need to justify the decision?
• interpretability and trust may be enough
• Unless there are audits or legal discovery involved
▪ Do I need to reproduce the results?
• May not need interpretability
• Need a lot of other things
Copyright Third Nature, Inc.
The real need is trust. Our trust is
based on all the elements that
are involved, not just the model.
The higher the stakes the more
you must think about all the ways
it could go wrong.
Reliability and robustness of the
technology environment is as
important as the model.
Reproducibility is not just the
data scientist’s problem – it is an
operational concern.
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Three categories of use, with differing trust levels
Embedded
Human in
the loop
Ad hoc
Copyright Third Nature, Inc.
Explainability only addresses the functioning of the model
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Three categories of use, with differing trust levels
Embedded
Human in
the loop
Ad hoc
In general, explainability is
more useful when there is
human agency and control in
decisions.
Reproducibility is increasingly
important where the machine
has agency because the
impact of automation expands
the scope and increases the
complexity beyond the model.
Machine learning is the smallest part of the environment
ML
Code
Analysis Tools
Data
Collection
Machine
Resource
Management
Serving
Infrastructure
Feature
Extraction
Configuration
Data
Verification
Process
Management
Tools
Monitoring
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
©2018 Teradata
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
The end result in the data science landscape is complexity
You can explain the model, but can you explain this?
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
ML Principle: CACE, Change Anything Change Everything
In embedded or
autonomous ML
everything is
connected.
Events usually
happen in real time.
ML is very sensitive
to context and input.
“So what if I
changed NumPy
in dev?”
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Working backwards from the decision or answer,
what do you need to have reproducibiity?
ProcessSource Run Env
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Fine for a single-run model. What if it’s
embedded in an application somewhere?
ProcessSource Run Env
App EnvProcess
Deploy
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
If a model is used over time, you will likely
change it. Or sources, or processes will change.
ProcessSource Run Env
App EnvProcess
Deploy
Run Env
Run Env
Process
Process
Run Env
Process Run Env
Process
Copyright Third Nature, Inc.
We’re so focused on the light switch that
we’re not talking about the light.
Copyright Third Nature, Inc.
Most emphasis in the
industry is on code and
code artifacts:
▪ Model repositories
▪ Model management
▪ Pipeline frameworks
▪ Packaging
▪ Versioning
▪ Tools
Why? Because vendors want
to sell you products for the
problem they helped create.
Managing the code
without managing
the data?
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
TIDY DATA: Hadley Wickham makes the case for
Tidy data sets, that have specific structure, are easy to
work with, that free analysts from mundane data
manipulation chores – there’s no need to start from
scratch and reinvent new methods for data cleaning
Source: Tidy Data by Hadley Wickham, Journal of Statistical Software, Vol 59, issue 10 (2014)
https://vita.had.co.nz/papers/tidy-data.html
What do the experts say?
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Andrew Ng
Leading AI Researcher
“If your boss asks you, tell
them that I said build a unified
data warehouse”
What about AI? GIGO!
Everyone wants shortcuts. There’s aren’t any shortcuts
Copyright Third Nature, Inc.
Reproducibility? No problem. Just save versions of data
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Today’s market solution:
the Data Lake*
*Depiction for illustrative
purposes only, no warranty
express or implied, actual
system may vary.
By a lot.
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Today’s market solution:
the Data Lake + DIY
Data hoarding is not a data
management strategy
Copyright Third Nature, Inc.
A standard data science approach (to avoid)
One-Pipeline-
Per-Process
Redundant Effort /
Cost / Complexity /
etc.
WELL
HEAD
WELL
HEAD
Example use cases
Copyright Third Nature, Inc.
A dystopian model: Lake + data engineers + pipelines
The Lake with pipelines to files or tables for each model
is exactly the same pattern as mainframe COBOL batch
Data pipeline
code
We already know that people don’t scale. Don’t do this
Copyright Third Nature, Inc.
Self-service tradeoffs
Pay now or pay later,
but you will always pay.
The question is which
payment will be less.
Self-service gives
flexibility and agility, but
can reduce repeatability
and add duplicated
effort and conflicts, and
an increase in risk.
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
But that self-serve bit only applies to building models.
Embedding them in the enterprise is more involved
ProcessSource Run Env
App EnvProcess
Deploy
Run Env
Run Env
Process
Process
Run Env
Process Run Env
Process
Managing the data
on an ongoing
basis is a complex
problem to solve
Copyright Third Nature, Inc.
The organization-wide focus needs to be on
repeatability – where it can be supported
Copyright Third Nature, Inc.
Buildings above: flexibility, repurposing,
faster change above, funded independently
Applications
Utilities below: stability, reuse, slow
predictable change below, funded centrally
Infrastructure
The infrastructure is a hidden combination
of technology, process, and methods.
There is a need to draw boundaries for
responsibilities in the organization. It can’t
be all business or all IT.
Separate the uses from the infrastructure
– focus on the city, not the buildings
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
2855
Out of Deviation
Sensor Readings
Fraud
Events
CUSTOMER
EXPERIENCE
SALES
MARKETING
OPERATIONS
Abandoned
Carts
Online Price
Quotes
Social Media
Influencers
FINANCE
11234
Sensor data
formatting
Unit
Normalization
Vehicle/version
normalization
Make/model/year
selection
Data pipelines, data architecture, and curation
Copyright Third Nature, Inc.
Data curation is not data modeling
The problem with so many
sources, types, formats and
latencies of data is that it is now
impossible to create in advance
one model for all of it.
Data modeling is about the inside
of a dataset. Curation is about
the entire dataset.
It’s about: creating, labeling,
organizing, finding, navigating,
retiring.
Data curation, rather than data
modeling, is becoming the most
important data management
practice.
Copyright Third Nature, Inc.
Data can be maintained at multiple levels: not raw or DW
Ingredients
Goal: available
User needs a recipe
in order to make
use of the data.
Pre-mixed
Goal: discoverable
and integrateable
User needs a menu
to choose from the
data available
Meals
Goal: usable
User needs utensils
but is given a
finished meal
Copyright Third Nature, Inc.
Data architecture dictates technology and process
Collection
Capture metadata
(including request)
Record the structure
Apply keys
PII masking, restrict
Start lineage
Distribution
Common structures
RDM and MDM
Subject models
Make data findable
and linkable
Data provisioning
Consumption
Model for target use
Quality rules
Track provenance
Apply SLAs
As few engines as
possible – not fewer
Decide on policies for when to place data in which area
Copyright Third Nature, Inc.Copyright Third Nature, Inc.
ML has a lot of data requirements: no shortcuts
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
Copyright Third Nature, Inc.
Manage your data
(or it will manage you)
Data is the foundation of the
models, and the results. Focus on
managing data as well as the
code and tool artifacts.
Reproducibility, and data science
in general, requires that you
treat the operating and the build
environments as a whole.
Data science in the long term is
not about individual projects, but
organizational capability. This
means infrastructure, process,
governance.
Copyright Third Nature, Inc.
Things you can do
1. Determine what level of risk is acceptable based on the costs of
false positives and false negatives, or poor accuracy.
2. Manage the data history such that you can always reconstruct it.
3. Think through the dependencies outside the model that affect
reproducibility.
4. Track the configuration of everything in the chain of
dependencies.
5. Do not let a thousand flowers bloom with BYOT. You will have to
cut many of them due to complexity.
6. Do not let IT overcomplicate your environment with technology
because of belief things must be big. Brittle infrastructure can’t
be tolerated with embedded machine learning.
7. Make a rational decision about how much effort to expend.
Mark Madsen is a Fellow at Teradata in
the Technology and Innovation Office.
he focused on the data and analytics
ecosystem, problems of large-scale
design, and R&D.
Prior to that he was president of Third
Nature, a research and consulting firm
focused on analytics, integration and
data management.
Mark is an award-winning author,
architect and CTO whose work has
been featured in numerous industry
publications. He received awards for
his work from the American
Productivity & Quality Center and the
Smithsonian Institute. He is an
international speaker, chairs several
conferences and program committees.
Mark Madsen

More Related Content

What's hot

Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
mark madsen
 
Strata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Strata Data Conference 2019 : Scaling Visualization for Big Data in the CloudStrata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Strata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Jaipaul Agonus
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science Teams
Ganes Kesari
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
mark madsen
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big data
Microsoft
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
Domino Data Lab
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
DataWorks Summit/Hadoop Summit
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehouse
mark madsen
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
Sri Ambati
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
EMC
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
mark madsen
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
Vasu S
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
mark madsen
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
Hugo Bowne-Anderson
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
Himanshu Bari
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
swethaT16
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
Vipul Kalamkar
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science process
Mathieu d'Aquin
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
mark madsen
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
joshwills
 

What's hot (20)

Assumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slidesAssumptions about Data and Analysis: Briefing room webcast slides
Assumptions about Data and Analysis: Briefing room webcast slides
 
Strata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Strata Data Conference 2019 : Scaling Visualization for Big Data in the CloudStrata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
Strata Data Conference 2019 : Scaling Visualization for Big Data in the Cloud
 
How to Build Data Science Teams
How to Build Data Science TeamsHow to Build Data Science Teams
How to Build Data Science Teams
 
Wake up and smell the data
Wake up and smell the dataWake up and smell the data
Wake up and smell the data
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big data
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Everything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data WarehouseEverything Has Changed Except Us: Modernizing the Data Warehouse
Everything Has Changed Except Us: Modernizing the Data Warehouse
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?Disruptive Innovation: how do you use these theories to manage your IT?
Disruptive Innovation: how do you use these theories to manage your IT?
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
 
Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)Bi isn't big data and big data isn't BI (updated)
Bi isn't big data and big data isn't BI (updated)
 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
A data view of the data science process
A data view of the data science processA data view of the data science process
A data view of the data science process
 
Everything has changed except us
Everything has changed except usEverything has changed except us
Everything has changed except us
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
 

Similar to The Black Box: Interpretability, Reproducibility, and Data Management

Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
Peculium Crypto
 
Intro to ai application emeritus uob-final
Intro to ai application emeritus uob-finalIntro to ai application emeritus uob-final
Intro to ai application emeritus uob-final
Luis Fernando Gonzalez Sanchez
 
Mission Critical Use Cases Show How Analytics Architectures Usher in an Artif...
Mission Critical Use Cases Show How Analytics Architectures Usher in an Artif...Mission Critical Use Cases Show How Analytics Architectures Usher in an Artif...
Mission Critical Use Cases Show How Analytics Architectures Usher in an Artif...
Dana Gardner
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
Dario Mangano
 
Questions On Technical Design Decisions
Questions On Technical Design DecisionsQuestions On Technical Design Decisions
Questions On Technical Design Decisions
Rikki Wright
 
Data Maturity for Nonprofits: Three Perspectives, Nine Lessons, and Three Ass...
Data Maturity for Nonprofits: Three Perspectives, Nine Lessons, and Three Ass...Data Maturity for Nonprofits: Three Perspectives, Nine Lessons, and Three Ass...
Data Maturity for Nonprofits: Three Perspectives, Nine Lessons, and Three Ass...
Karen Graham
 
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
Big Data Week
 
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Dana Gardner
 
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Goodbuzz Inc.
 
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
Leslie McFarlin
 
Better the devil you know
Better the devil you knowBetter the devil you know
Better the devil you know
Alexandra Deschamps-Sonsino
 
Intel Faster Risk Oct08 - Andrew Parry
Intel Faster Risk Oct08 - Andrew ParryIntel Faster Risk Oct08 - Andrew Parry
Intel Faster Risk Oct08 - Andrew Parry
mikeohara
 
A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016
Jon Hawes
 
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Jon Mead
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Amazon Web Services Korea
 
Analytics Trends 20145 - Deloitte - us-da-analytics-analytics-trends-2015
Analytics Trends 20145 -  Deloitte - us-da-analytics-analytics-trends-2015Analytics Trends 20145 -  Deloitte - us-da-analytics-analytics-trends-2015
Analytics Trends 20145 - Deloitte - us-da-analytics-analytics-trends-2015
Edgar Alejandro Villegas
 
Demystifying ML/AI
Demystifying ML/AIDemystifying ML/AI
Demystifying ML/AI
Matthew Reynolds
 
AI in healthcare
AI in healthcareAI in healthcare
AI in healthcare
Chinmay Patel
 
5 Things You Should Know About Data Products
5 Things You Should Know About Data Products5 Things You Should Know About Data Products
5 Things You Should Know About Data Products
Scribble Data
 
Focus on Data, Risk Control, and Predictive Analysis Drives New Era of Cloud-...
Focus on Data, Risk Control, and Predictive Analysis Drives New Era of Cloud-...Focus on Data, Risk Control, and Predictive Analysis Drives New Era of Cloud-...
Focus on Data, Risk Control, and Predictive Analysis Drives New Era of Cloud-...
Dana Gardner
 

Similar to The Black Box: Interpretability, Reproducibility, and Data Management (20)

Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
Intro to ai application emeritus uob-final
Intro to ai application emeritus uob-finalIntro to ai application emeritus uob-final
Intro to ai application emeritus uob-final
 
Mission Critical Use Cases Show How Analytics Architectures Usher in an Artif...
Mission Critical Use Cases Show How Analytics Architectures Usher in an Artif...Mission Critical Use Cases Show How Analytics Architectures Usher in an Artif...
Mission Critical Use Cases Show How Analytics Architectures Usher in an Artif...
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
 
Questions On Technical Design Decisions
Questions On Technical Design DecisionsQuestions On Technical Design Decisions
Questions On Technical Design Decisions
 
Data Maturity for Nonprofits: Three Perspectives, Nine Lessons, and Three Ass...
Data Maturity for Nonprofits: Three Perspectives, Nine Lessons, and Three Ass...Data Maturity for Nonprofits: Three Perspectives, Nine Lessons, and Three Ass...
Data Maturity for Nonprofits: Three Perspectives, Nine Lessons, and Three Ass...
 
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
 
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
Industrializing Data Science: Transform into an End-to-End, Analytics-Oriente...
 
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
 
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
ML Times: Mainframe Machine Learning Initiative- June newsletter (2018)
 
Better the devil you know
Better the devil you knowBetter the devil you know
Better the devil you know
 
Intel Faster Risk Oct08 - Andrew Parry
Intel Faster Risk Oct08 - Andrew ParryIntel Faster Risk Oct08 - Andrew Parry
Intel Faster Risk Oct08 - Andrew Parry
 
A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016A strategy for security data analytics - SIRACon 2016
A strategy for security data analytics - SIRACon 2016
 
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
 
Analytics Trends 20145 - Deloitte - us-da-analytics-analytics-trends-2015
Analytics Trends 20145 -  Deloitte - us-da-analytics-analytics-trends-2015Analytics Trends 20145 -  Deloitte - us-da-analytics-analytics-trends-2015
Analytics Trends 20145 - Deloitte - us-da-analytics-analytics-trends-2015
 
Demystifying ML/AI
Demystifying ML/AIDemystifying ML/AI
Demystifying ML/AI
 
AI in healthcare
AI in healthcareAI in healthcare
AI in healthcare
 
5 Things You Should Know About Data Products
5 Things You Should Know About Data Products5 Things You Should Know About Data Products
5 Things You Should Know About Data Products
 
Focus on Data, Risk Control, and Predictive Analysis Drives New Era of Cloud-...
Focus on Data, Risk Control, and Predictive Analysis Drives New Era of Cloud-...Focus on Data, Risk Control, and Predictive Analysis Drives New Era of Cloud-...
Focus on Data, Risk Control, and Predictive Analysis Drives New Era of Cloud-...
 

More from mark madsen

A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
mark madsen
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customers
mark madsen
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collection
mark madsen
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
mark madsen
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analytics
mark madsen
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)
mark madsen
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...
mark madsen
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good story
mark madsen
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
mark madsen
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followers
mark madsen
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
mark madsen
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Data
mark madsen
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
mark madsen
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolution
mark madsen
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Data
mark madsen
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
mark madsen
 

More from mark madsen (16)

A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou RangeA Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
A Brief Tour through the Geology & Endemic Botany of the Klamath-Siskiyou Range
 
A Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing CustomersA Pragmatic Approach to Analyzing Customers
A Pragmatic Approach to Analyzing Customers
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collection
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Briefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analyticsBriefing Room analyst comments - streaming analytics
Briefing Room analyst comments - streaming analytics
 
On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)On the edge: analytics for the modern enterprise (analyst comments)
On the edge: analytics for the modern enterprise (analyst comments)
 
Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...Crossing the chasm with a high performance dynamically scalable open source p...
Crossing the chasm with a high performance dynamically scalable open source p...
 
Don't let data get in the way of a good story
Don't let data get in the way of a good storyDon't let data get in the way of a good story
Don't let data get in the way of a good story
 
Big Data and Bad Analogies
Big Data and Bad AnalogiesBig Data and Bad Analogies
Big Data and Bad Analogies
 
Don't follow the followers
Don't follow the followersDon't follow the followers
Don't follow the followers
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Open Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing DataOpen Data: Free Data Isn't the Same as Freeing Data
Open Data: Free Data Isn't the Same as Freeing Data
 
Exploring cloud for data warehousing
Exploring cloud for data warehousingExploring cloud for data warehousing
Exploring cloud for data warehousing
 
Big Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data RevolutionBig Data Wonderland: Two Views on the Big Data Revolution
Big Data Wonderland: Two Views on the Big Data Revolution
 
Using Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big DataUsing Data Virtualization to Integrate With Big Data
Using Data Virtualization to Integrate With Big Data
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 

Recently uploaded

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 

Recently uploaded (20)

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 

The Black Box: Interpretability, Reproducibility, and Data Management

  • 1. Copyright Third Nature, Inc. Executive Briefing: The black box – interpretability, reproducibility, and data management O’Reilly Artificial Intelligence conference London October, 2019 Mark Madsen
  • 2. Copyright Third Nature, Inc.Copyright Third Nature, Inc. What does management care about? The perspective and problems of the person responsible for oversight of the results is spread across the organization and across multiple projects ▪ The owner of a decision or the outcome of the decision. Aka the process owner ▪ The CAO, CDO, VP of analytics, aka “your boss” if you’re a data scientist.
  • 3. Copyright Third Nature, Inc. Repeatability
  • 4. Copyright Third Nature, Inc. Operational predictability
  • 5. Reproducibility Can you support the answer at a later date? Can you replicate the process used to get the answer? Can you get similar results on the same or similar input?
  • 6. Copyright Third Nature, Inc. Scientific reproducibility Can you get the same results given the same starting conditions? ▪ Duplicate as much as possible the experiment. ▪ Results of experimental replication may not be the same for many reasons ▪ Detailed statistical analysis is needed ▪ And critical assessment of experiments
  • 7. Copyright Third Nature, Inc. Reproducibility in data science Can you get the same results on the same data? ▪ Direct replication is expected ▪ Assumption is that same input = same output ▪ You can have this with an unexplainable box ▪ But there are also confounding factors
  • 8. Copyright Third Nature, Inc. Moving from predictable rule-based systems to complex mathematical systems, and from there to systems that exhibit stochasticity, makes the task harder. One thing worse than a black box is a random black box.
  • 9. Copyright Third Nature, Inc. If you trust the box then it’s ok not to understand it. Without reproducibility there cannot be trust. Trust comes from it working It works = reliability, working over time Reliability of your solution is a systems problem, not a just a model problem (in the GST sense of system) Therefore, if there is trust, or there is low risk, you may not need to worry about reproducibility
  • 10. Copyright Third Nature, Inc. Interpretability and reproducibility are driven by trust, which only matters when there is enough risk Regulation and compliance Material decisions Big penalties Complication: the cost of false positives and false negatives are usually different, so model characteristics matter here. Cost of error Frequency of decision Low High High Don’t care Oh crap
  • 11. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Beyond toy examples, this problem matters Edge case error problems will limit AI applicability until they are solved (don’t hold your breath). The uncertainty challenge means: ▪ Calculate error costs and apply them to your model before (and after) ▪ The more risk averse, the more predictability is desire, the less variance one can tolerate
  • 12. Copyright Third Nature, Inc. The real questions Can I support this result at a later date? ▪ Do I care? • Is the risk (cost) worth worrying about? • Is the cost of reproducibility less than the risk and cost? ▪ Do I only need to justify the decision? • interpretability and trust may be enough • Unless there are audits or legal discovery involved ▪ Do I need to reproduce the results? • May not need interpretability • Need a lot of other things
  • 13. Copyright Third Nature, Inc. The real need is trust. Our trust is based on all the elements that are involved, not just the model. The higher the stakes the more you must think about all the ways it could go wrong. Reliability and robustness of the technology environment is as important as the model. Reproducibility is not just the data scientist’s problem – it is an operational concern.
  • 14. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Three categories of use, with differing trust levels Embedded Human in the loop Ad hoc
  • 15. Copyright Third Nature, Inc. Explainability only addresses the functioning of the model
  • 16. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Three categories of use, with differing trust levels Embedded Human in the loop Ad hoc In general, explainability is more useful when there is human agency and control in decisions. Reproducibility is increasingly important where the machine has agency because the impact of automation expands the scope and increases the complexity beyond the model.
  • 17. Machine learning is the smallest part of the environment ML Code Analysis Tools Data Collection Machine Resource Management Serving Infrastructure Feature Extraction Configuration Data Verification Process Management Tools Monitoring https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems ©2018 Teradata
  • 18. Copyright Third Nature, Inc.Copyright Third Nature, Inc. The end result in the data science landscape is complexity You can explain the model, but can you explain this?
  • 19. Copyright Third Nature, Inc.Copyright Third Nature, Inc. ML Principle: CACE, Change Anything Change Everything In embedded or autonomous ML everything is connected. Events usually happen in real time. ML is very sensitive to context and input. “So what if I changed NumPy in dev?”
  • 20. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Working backwards from the decision or answer, what do you need to have reproducibiity? ProcessSource Run Env
  • 21. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Fine for a single-run model. What if it’s embedded in an application somewhere? ProcessSource Run Env App EnvProcess Deploy
  • 22. Copyright Third Nature, Inc.Copyright Third Nature, Inc. If a model is used over time, you will likely change it. Or sources, or processes will change. ProcessSource Run Env App EnvProcess Deploy Run Env Run Env Process Process Run Env Process Run Env Process
  • 23. Copyright Third Nature, Inc. We’re so focused on the light switch that we’re not talking about the light.
  • 24. Copyright Third Nature, Inc. Most emphasis in the industry is on code and code artifacts: ▪ Model repositories ▪ Model management ▪ Pipeline frameworks ▪ Packaging ▪ Versioning ▪ Tools Why? Because vendors want to sell you products for the problem they helped create. Managing the code without managing the data?
  • 25. Copyright Third Nature, Inc.Copyright Third Nature, Inc. TIDY DATA: Hadley Wickham makes the case for Tidy data sets, that have specific structure, are easy to work with, that free analysts from mundane data manipulation chores – there’s no need to start from scratch and reinvent new methods for data cleaning Source: Tidy Data by Hadley Wickham, Journal of Statistical Software, Vol 59, issue 10 (2014) https://vita.had.co.nz/papers/tidy-data.html What do the experts say?
  • 26. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Andrew Ng Leading AI Researcher “If your boss asks you, tell them that I said build a unified data warehouse” What about AI? GIGO! Everyone wants shortcuts. There’s aren’t any shortcuts
  • 27. Copyright Third Nature, Inc. Reproducibility? No problem. Just save versions of data
  • 28. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Today’s market solution: the Data Lake* *Depiction for illustrative purposes only, no warranty express or implied, actual system may vary. By a lot.
  • 29. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Today’s market solution: the Data Lake + DIY Data hoarding is not a data management strategy
  • 30. Copyright Third Nature, Inc. A standard data science approach (to avoid) One-Pipeline- Per-Process Redundant Effort / Cost / Complexity / etc. WELL HEAD WELL HEAD Example use cases
  • 31. Copyright Third Nature, Inc. A dystopian model: Lake + data engineers + pipelines The Lake with pipelines to files or tables for each model is exactly the same pattern as mainframe COBOL batch Data pipeline code We already know that people don’t scale. Don’t do this
  • 32. Copyright Third Nature, Inc. Self-service tradeoffs Pay now or pay later, but you will always pay. The question is which payment will be less. Self-service gives flexibility and agility, but can reduce repeatability and add duplicated effort and conflicts, and an increase in risk.
  • 33. Copyright Third Nature, Inc.Copyright Third Nature, Inc. But that self-serve bit only applies to building models. Embedding them in the enterprise is more involved ProcessSource Run Env App EnvProcess Deploy Run Env Run Env Process Process Run Env Process Run Env Process Managing the data on an ongoing basis is a complex problem to solve
  • 34. Copyright Third Nature, Inc. The organization-wide focus needs to be on repeatability – where it can be supported
  • 35. Copyright Third Nature, Inc. Buildings above: flexibility, repurposing, faster change above, funded independently Applications Utilities below: stability, reuse, slow predictable change below, funded centrally Infrastructure The infrastructure is a hidden combination of technology, process, and methods. There is a need to draw boundaries for responsibilities in the organization. It can’t be all business or all IT. Separate the uses from the infrastructure – focus on the city, not the buildings
  • 36. Copyright Third Nature, Inc.Copyright Third Nature, Inc. 2855 Out of Deviation Sensor Readings Fraud Events CUSTOMER EXPERIENCE SALES MARKETING OPERATIONS Abandoned Carts Online Price Quotes Social Media Influencers FINANCE 11234 Sensor data formatting Unit Normalization Vehicle/version normalization Make/model/year selection Data pipelines, data architecture, and curation
  • 37. Copyright Third Nature, Inc. Data curation is not data modeling The problem with so many sources, types, formats and latencies of data is that it is now impossible to create in advance one model for all of it. Data modeling is about the inside of a dataset. Curation is about the entire dataset. It’s about: creating, labeling, organizing, finding, navigating, retiring. Data curation, rather than data modeling, is becoming the most important data management practice.
  • 38. Copyright Third Nature, Inc. Data can be maintained at multiple levels: not raw or DW Ingredients Goal: available User needs a recipe in order to make use of the data. Pre-mixed Goal: discoverable and integrateable User needs a menu to choose from the data available Meals Goal: usable User needs utensils but is given a finished meal
  • 39. Copyright Third Nature, Inc. Data architecture dictates technology and process Collection Capture metadata (including request) Record the structure Apply keys PII masking, restrict Start lineage Distribution Common structures RDM and MDM Subject models Make data findable and linkable Data provisioning Consumption Model for target use Quality rules Track provenance Apply SLAs As few engines as possible – not fewer Decide on policies for when to place data in which area
  • 40. Copyright Third Nature, Inc.Copyright Third Nature, Inc. ML has a lot of data requirements: no shortcuts https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 41. Copyright Third Nature, Inc. Manage your data (or it will manage you) Data is the foundation of the models, and the results. Focus on managing data as well as the code and tool artifacts. Reproducibility, and data science in general, requires that you treat the operating and the build environments as a whole. Data science in the long term is not about individual projects, but organizational capability. This means infrastructure, process, governance.
  • 42. Copyright Third Nature, Inc. Things you can do 1. Determine what level of risk is acceptable based on the costs of false positives and false negatives, or poor accuracy. 2. Manage the data history such that you can always reconstruct it. 3. Think through the dependencies outside the model that affect reproducibility. 4. Track the configuration of everything in the chain of dependencies. 5. Do not let a thousand flowers bloom with BYOT. You will have to cut many of them due to complexity. 6. Do not let IT overcomplicate your environment with technology because of belief things must be big. Brittle infrastructure can’t be tolerated with embedded machine learning. 7. Make a rational decision about how much effort to expend.
  • 43. Mark Madsen is a Fellow at Teradata in the Technology and Innovation Office. he focused on the data and analytics ecosystem, problems of large-scale design, and R&D. Prior to that he was president of Third Nature, a research and consulting firm focused on analytics, integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. He received awards for his work from the American Productivity & Quality Center and the Smithsonian Institute. He is an international speaker, chairs several conferences and program committees. Mark Madsen