June Andrews - The Uncanny Valley of ML

The
Uncanny Valley
of ML
Dr June Andrews Delphi Data Nov 2019

Human Decision Systems
Simple Paradigm Represents:
• Judges setting bail
• Doctors processing images
• DMV clerks renewing licenses
• Muni train drivers stopping & going
• Administrators admitting students
• … you coming to MLConf
Information In, Decision Out, Works Pretty Well

Hype: Technology Will Replace People Overnight

???
Far More Likely Progression of Technology
Machine Learning with Augment Decision Making with Recommendations

Template Design for ML + Decision Systems
Recommendation: Yes
Feature A is high
Keep the Chain of Responsibility Intact
Integrate as soon as ML Accuracy ≈ Human Accuracy

Decreasing
cost of ML
Pressures for Introducing ML into Decision Systems
Increasing data
from revolutions in
sensors, records &
infrastructure
Fewer experts
graduating in
‘older ﬁelds’
Increasing number
of decisions
created by more
people

Ideal Result of ML + Human Decisions
ML Accuracy
Terrible Human Perfect
Accuracy
Time & Cost
As ML Accuracy Approaches Human Accuracy, System Performance Improves
LowHigh

The Uncanny Valley of ML
As ML Accuracy Approaches Human Accuracy, System Performance Degrades
ML Accuracy
Accuracy
Time & Cost
LowHigh

Finding
the
Uncanny
Valley
of ML

Finding An Uncanny Valley of ML
Use Test Environments To Avoid The Uncanny Valley in Production
1. Create a simple
labeling task with
ground truth labels
2. Measure human
accuracy & speed
3.
1. Add a recommended decision
from a ‘Model’
2. Simulate models of different
accuracy near human accuracy
by perturbing the ground truth
labels
3. Assign each person to a
simulated model and run test
labels for normalization
4. Measure system accuracy &
speed as a function of ML
accuracy
5.

Easy Street - How many coffee mugs do you see?
Throwback to the ﬁrst demos of Neural Nets for Compute Vision @ Cornell
100+ photos hand labeled
with the number of coffee
mugs for ground truth
Label quality is then
perturbed to simulate
different ML accuracies, with
a bias to perturbing the
images with many mugs

100+ photos hand labeled
with the number of coffee
mugs for ground truth
Label quality is then
perturbed to simulate
different ML accuracies, with
a bias to perturbing the
images with many mugs
…used Amazon Mechanical
Turk workers
Easy Street - How many coffee mugs do you see?
Throwback to the ﬁrst demos of Neural Nets for Compute Vision @ Cornell

2 mugs
1 mug
Ideal Results
ML Accuracy
Accuracy
Time & Cost
LowHigh

Actual System Behavior
SystemAccuracy
ML Accuracy
Terrible Human (94%) Perfect
2 mugs
1 mug
LowHigh

Actual System Behavior
DecisionTime
ML Accuracy
Terrible Human (94%) Perfect
2 mugs
1 mug
LowHigh

Uncanny Valley for ML in
Counting Coffee Mugs?
Do people want machines to be wrong?
We trust machines more than we trust
ourselves when they are near but not over
human accuracy?
We’re lazy and want to defer decision
making (people varied from the ML when
it was correct)
2 mugs
1 mug

The Uncanny
Valley of ML
in the
Judicial
System

Decreasing
cost of ML
Pressures for Introducing ML into Court Systems
Increasing data
from records &
social media
Fewer experts
graduating in
‘older ﬁelds’
Increasing number of decisions
created by more people

Estonia is building a ‘Robot
Judge’ to settle disputes
under $8,000 - DailyMail
Broader initiative of e-
government. France wants
to match Estonia's level by
2022
160,000 parking tickets
overturned in the UK & US
with a chatbot -Guardian
on DoNotPay
Risk Score Print Outs in Cleveland. Includes features
like ‘how often are you bored’ -Quartz
Note - arraignment hearings are often under 5
minutes.
ML has a Growing Presence in Courts
Countries are Comparing Notes & Learning How to Use AI in the Courts

Locally ML is used in the
Judicial System for Bail
In California 49 of 58 counties
use a Pretrial Assessment System
(yes SF is one) [courts.ca.gov]
SB 10 signed in 2018 would
make it mandatory in October of
2019, but a 2020 referendum
contradicting SB 10 has created
a temporary pause

Just a sec, does Bail Matter?
• 20% of jail inmates in US are
awaiting trial
• Misdemeanors can take several
months for trial, felonies can take
years. Average wait time in the
Bronx is 642 days for a non-jury
trial and 827 days for a jury trial.
• Pretrial detention leads to 13%
increase in plea agreements, 42%
increase in length of sentence and
41% increase increase in court fees
-Stevenson The Journal of Law,
Economics, and Organization
8th Amendment (Bill of Rights)
‘Excessive bail shall not be
required, nor excessive ﬁnes
imposed, nor cruel and unusual
punishments inﬂicted.’

Unknown System Accuracy, Show Manipulation of a Single Label
1. Take a Real Case
2.
1. Simulate different UI’s &
different model deliverables
2. Compare label distribution
with actual outcome

Unknown System Accuracy, Instead Show Manipulation of a Single Label
Details Taken from Machine Bias by Propublica -2016
Summary - high schooler stole a bike for a few blocks, had a
High Risk Compas Score by Equivalent. Bail was set at $1000

400+ Survey Participants From Amazon MTurk
No ML
ML - Low Risk
ML - Medium Risk
ML - High Risk
ML - High Risk Positive Support
ML - High Risk Negative Support
Proportion of Responses
0% 33% 67% 100%
$0 Bail $1000 Bail No Release
Power of Suggestion of High Risk ML (no reason) results in +14% in bail denied
Power of Suggestion of Low Risk ML (no Reason) results in +14% in $0 bail
High Risk ML with Negative Features results in +40% in denying bail

104 Survey Participants From June’s Network
No ML
ML - Positive Support
ML - Negative Support
Proportion of Responses
0% 33% 67% 100%
$0 Bail $1000 Bail No Release
While overall more forgiving group, still +40% increase denying
bail
A Higher Level of Machine Learning Knowledge Does NOT Change the Trend

Fundamental Design Flaw in
SB10 & Compas Scores
• Need to allow for the ML system to
return ‘Uncertain, not enough data’
• The Bureau of Justice Statistics has
a warning of ‘Interpret data with
caution. Estimate based on 10 or
fewer sample cases’ for someone
with Brisha’s details
• … also, effectiveness of requiring
ML in the California courts is not
slated to be measured until 2023,
4 years after release
Aside:

The
Uncanny
Valley of
ML
- Why does it exist?

Uncanny Valley of AI
Discovered by Masahiro Mori (1970)
Box office success of movies is
potentially related to the Uncanny
Valley:
• Final Fantasy
• Polar Express
• Beowulf
• The Incredible Hulk

Uncanny Valley of AI
Why it exists is open ﬁeld of research:
• Mismatch between expectations
and observations [Tinwell]
• Difficult to classify objects that
move between the boundaries of
categories [Looser & Wheatly]
• Recognizing a similar cognitive
[Fray & Wegner]
• Ambiguity about the presence of
threat [McAndrews]
When it exists is also a debate.
2013 Activision Animation

Uncanny Valley of ML
Additional Theories to Consider:
• People excuse biased decisions on
the machine
• People want machines to be wrong
• Disagreeing is a different skill set
than analyzing
• Providing explanations of reasoning
suppresses intuitive decisions
When it exists should also be studied
further.

Thought Experiment
Yes / No
Imagine if the ML system was
another person, who wasn’t
quite as bright as the ﬁrst
person
It would take longer, the bright
person would question
themselves more
Small Group Communication: A
Theoretical Approach has
additional details of when
groups underperform
individuals
Lean on the ﬁeld of Team Research to Bootstrap Expectations on Integration

Crossing
the
Chasm
- Avoiding the Uncanny Valley

Self-Driving Cars - Headed for Uncanny Valley
People viewed as backups who would stay behind the wheel and intervene to
avoid accidents in unpredictable or computer confusing instants. Self-driving
option should be included as soon as possible for competitive advantage.
Left
No-op
Right
Left Right
Decreasing cost of ML
Increasing data
from revolutions in
sensors & records
Aging population
Increasing number of
drivers, commutes
increasing

Best Practice: Bet on the Power of ML
Left
No-op
Right
Left Right
Volvo changed from targeting Level 2 to
Levels {4, 5} after including executives in a
simulation of driving a Level 2 car [Wired]
Delaying Release Until Performance Crosses the Valley

Best Practice: Build Simulators
Nuclear Power Plants, Aviation,
Moon Landings … all use
simulators to reﬁne product
designs before launch
**include actual judges/experts
in simulations
Identify location and impact of the Valley before building

Best Practice: Avoid by Redeﬁning Success
Repurpose - ML
designed for 1 system,
may work well for
another
Reset expectations Relabel Bad Labels

The Uncanny Valley
Doesn’t Always Exist
First UI, people completely ignored the ML suggestion
You could design a system no one uses to avoid the Valley
System
Accuracy
ML Accuracy
2 mugs
1 mug

Call to Action:
Source a New Field of Research ‘ML Integration’
• HCI, Team Research, Data
Science, AI, Psychology, User
Research & Application Fields are
all trying to understand
integrating ML into Human
Decision Systems …
independently and slowly
• Binding efforts into a single
discipline will rapidly increase
development and possibly meet
demand
* I am not qualiﬁed to give this talk … but who is?

Bootstrapping ML Integration
Funding Sources {Military, Accenture,
Academia, …?}
Initial Areas of Research:
• How to calculate the speed and
accuracy of large distributed
human + ML decision systems
• How to safely train and roll out
new decision processes to experts
• How to fairly explain a ML
decision. Beyond explainable AI.
• Design and run experiments in
these systems
Machine Learning
DS
AI
Politics
Team Research
Effective
ML Integration

The Hard Part - Does The Uncanny Valley matter?
ML Accuracy
Accuracy
Time & Cost
LowHigh

Delivering Results in
the Uncanny Valley
Can Lead to Early
Project Termination
• These are critical systems
where a 5% drop in accuracy
undoes years of research and
investments. Mistakes are not
treated kindly in these ﬁelds.
• Funding is much more tightly
controlled and hard to obtain
after failed launches.
• Legal modiﬁcations may
become barriers

Call to Action: Implement Guardrail Metrics for
Vulnerable Members of the Population
• Guardrail Metrics are
used to allow ML to
optimize as much as it
can within a speciﬁed
business boundary
• Let’s deﬁne a boundary
in tech to not make
systems worse for black
girls

Thank You.
Slides at /drandrews

June Andrews - The Uncanny Valley of ML

Recommended

Recommended

More Related Content

Similar to June Andrews - The Uncanny Valley of ML

Similar to June Andrews - The Uncanny Valley of ML (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

June Andrews - The Uncanny Valley of ML