Every so often, the conundrum of the Uncanny Valley re-emerges as advanced technologies evolve from clearly experimental products to refined accepted technologies. We have seen its effects in robotics, computer graphics, and page load times. The debate of how to handle the new technology detracts from its benefits. When machine learning is added to human decision systems a similar effect can be measured in increased response time and decreased accuracy. These systems include radiology, judicial assignments, bus schedules, housing prices, power grids and a growing variety of applications. Unfortunately, the Uncanny Valley of ML can be hard to detect in these systems and can lead to degraded system performance when ML is introduced, at great expense. Here, we’ll introduce key design principles for introducing ML into human decision systems to navigate around the Uncanny Valley and avoid its pitfalls.
2. Human Decision Systems
Simple Paradigm Represents:
• Judges setting bail
• Doctors processing images
• DMV clerks renewing licenses
• Muni train drivers stopping & going
• Administrators admitting students
• … you coming to MLConf
Information In, Decision Out, Works Pretty Well
4. ???
Far More Likely Progression of Technology
Machine Learning with Augment Decision Making with Recommendations
5. Template Design for ML + Decision Systems
Recommendation: Yes
Feature A is high
Keep the Chain of Responsibility Intact
Integrate as soon as ML Accuracy ≈ Human Accuracy
6. Decreasing
cost of ML
Pressures for Introducing ML into Decision Systems
Increasing data
from revolutions in
sensors, records &
infrastructure
Fewer experts
graduating in
‘older fields’
Increasing number
of decisions
created by more
people
7. Ideal Result of ML + Human Decisions
ML Accuracy
Terrible Human Perfect
Accuracy
Time & Cost
As ML Accuracy Approaches Human Accuracy, System Performance Improves
LowHigh
8. The Uncanny Valley of ML
As ML Accuracy Approaches Human Accuracy, System Performance Degrades
ML Accuracy
Terrible Human Perfect
Accuracy
Time & Cost
LowHigh
10. Finding An Uncanny Valley of ML
Use Test Environments To Avoid The Uncanny Valley in Production
1. Create a simple
labeling task with
ground truth labels
2. Measure human
accuracy & speed
3.
1. Add a recommended decision
from a ‘Model’
2. Simulate models of different
accuracy near human accuracy
by perturbing the ground truth
labels
3. Assign each person to a
simulated model and run test
labels for normalization
4. Measure system accuracy &
speed as a function of ML
accuracy
5.
11. Easy Street - How many coffee mugs do you see?
Throwback to the first demos of Neural Nets for Compute Vision @ Cornell
100+ photos hand labeled
with the number of coffee
mugs for ground truth
Label quality is then
perturbed to simulate
different ML accuracies, with
a bias to perturbing the
images with many mugs
12. 100+ photos hand labeled
with the number of coffee
mugs for ground truth
Label quality is then
perturbed to simulate
different ML accuracies, with
a bias to perturbing the
images with many mugs
…used Amazon Mechanical
Turk workers
Easy Street - How many coffee mugs do you see?
Throwback to the first demos of Neural Nets for Compute Vision @ Cornell
13. 2 mugs
1 mug
Ideal Results
ML Accuracy
Terrible Human Perfect
Accuracy
Time & Cost
LowHigh
16. Uncanny Valley for ML in
Counting Coffee Mugs?
Do people want machines to be wrong?
We trust machines more than we trust
ourselves when they are near but not over
human accuracy?
We’re lazy and want to defer decision
making (people varied from the ML when
it was correct)
2 mugs
1 mug
18. Decreasing
cost of ML
Pressures for Introducing ML into Court Systems
Increasing data
from records &
social media
Fewer experts
graduating in
‘older fields’
Increasing number of decisions
created by more people
19. Estonia is building a ‘Robot
Judge’ to settle disputes
under $8,000 - DailyMail
Broader initiative of e-
government. France wants
to match Estonia's level by
2022
160,000 parking tickets
overturned in the UK & US
with a chatbot -Guardian
on DoNotPay
Risk Score Print Outs in Cleveland. Includes features
like ‘how often are you bored’ -Quartz
Note - arraignment hearings are often under 5
minutes.
ML has a Growing Presence in Courts
Countries are Comparing Notes & Learning How to Use AI in the Courts
20. Locally ML is used in the
Judicial System for Bail
In California 49 of 58 counties
use a Pretrial Assessment System
(yes SF is one) [courts.ca.gov]
SB 10 signed in 2018 would
make it mandatory in October of
2019, but a 2020 referendum
contradicting SB 10 has created
a temporary pause
21. Just a sec, does Bail Matter?
• 20% of jail inmates in US are
awaiting trial
• Misdemeanors can take several
months for trial, felonies can take
years. Average wait time in the
Bronx is 642 days for a non-jury
trial and 827 days for a jury trial.
• Pretrial detention leads to 13%
increase in plea agreements, 42%
increase in length of sentence and
41% increase increase in court fees
-Stevenson The Journal of Law,
Economics, and Organization
8th Amendment (Bill of Rights)
‘Excessive bail shall not be
required, nor excessive fines
imposed, nor cruel and unusual
punishments inflicted.’
22. Finding An Uncanny Valley of ML
Unknown System Accuracy, Show Manipulation of a Single Label
1. Take a Real Case
2.
1. Simulate different UI’s &
different model deliverables
2. Compare label distribution
with actual outcome
23. Finding An Uncanny Valley of ML
Unknown System Accuracy, Instead Show Manipulation of a Single Label
Details Taken from Machine Bias by Propublica -2016
Summary - high schooler stole a bike for a few blocks, had a
High Risk Compas Score by Equivalent. Bail was set at $1000
24. 400+ Survey Participants From Amazon MTurk
No ML
ML - Low Risk
ML - Medium Risk
ML - High Risk
ML - High Risk Positive Support
ML - High Risk Negative Support
Proportion of Responses
0% 33% 67% 100%
$0 Bail $1000 Bail No Release
Power of Suggestion of High Risk ML (no reason) results in +14% in bail denied
Power of Suggestion of Low Risk ML (no Reason) results in +14% in $0 bail
High Risk ML with Negative Features results in +40% in denying bail
25. 104 Survey Participants From June’s Network
No ML
ML - Positive Support
ML - Negative Support
Proportion of Responses
0% 33% 67% 100%
$0 Bail $1000 Bail No Release
While overall more forgiving group, still +40% increase denying
bail
A Higher Level of Machine Learning Knowledge Does NOT Change the Trend
26. Fundamental Design Flaw in
SB10 & Compas Scores
• Need to allow for the ML system to
return ‘Uncertain, not enough data’
• The Bureau of Justice Statistics has
a warning of ‘Interpret data with
caution. Estimate based on 10 or
fewer sample cases’ for someone
with Brisha’s details
• … also, effectiveness of requiring
ML in the California courts is not
slated to be measured until 2023,
4 years after release
Aside:
28. Uncanny Valley of AI
Discovered by Masahiro Mori (1970)
Box office success of movies is
potentially related to the Uncanny
Valley:
• Final Fantasy
• Polar Express
• Beowulf
• The Incredible Hulk
29. Uncanny Valley of AI
Why it exists is open field of research:
• Mismatch between expectations
and observations [Tinwell]
• Difficult to classify objects that
move between the boundaries of
categories [Looser & Wheatly]
• Recognizing a similar cognitive
[Fray & Wegner]
• Ambiguity about the presence of
threat [McAndrews]
When it exists is also a debate.
2013 Activision Animation
30. Uncanny Valley of ML
Additional Theories to Consider:
• People excuse biased decisions on
the machine
• People want machines to be wrong
• Disagreeing is a different skill set
than analyzing
• Providing explanations of reasoning
suppresses intuitive decisions
When it exists should also be studied
further.
31. Thought Experiment
Yes / No
Imagine if the ML system was
another person, who wasn’t
quite as bright as the first
person
It would take longer, the bright
person would question
themselves more
Small Group Communication: A
Theoretical Approach has
additional details of when
groups underperform
individuals
Lean on the field of Team Research to Bootstrap Expectations on Integration
33. Self-Driving Cars - Headed for Uncanny Valley
People viewed as backups who would stay behind the wheel and intervene to
avoid accidents in unpredictable or computer confusing instants. Self-driving
option should be included as soon as possible for competitive advantage.
Left
No-op
Right
Left Right
Decreasing cost of ML
Increasing data
from revolutions in
sensors & records
Aging population
Increasing number of
drivers, commutes
increasing
34. Best Practice: Bet on the Power of ML
Left
No-op
Right
Left Right
Volvo changed from targeting Level 2 to
Levels {4, 5} after including executives in a
simulation of driving a Level 2 car [Wired]
Delaying Release Until Performance Crosses the Valley
35. Best Practice: Build Simulators
Nuclear Power Plants, Aviation,
Moon Landings … all use
simulators to refine product
designs before launch
**include actual judges/experts
in simulations
Identify location and impact of the Valley before building
36. Best Practice: Avoid by Redefining Success
Repurpose - ML
designed for 1 system,
may work well for
another
Reset expectations Relabel Bad Labels
37. The Uncanny Valley
Doesn’t Always Exist
First UI, people completely ignored the ML suggestion
You could design a system no one uses to avoid the Valley
System
Accuracy
ML Accuracy
Terrible Human Perfect
2 mugs
1 mug
38. Delivering Results in
the Uncanny Valley
Can Lead to Early
Project Termination
• These are critical systems
where a 5% drop in accuracy
undoes years of research and
investments. Mistakes are not
treated kindly in these fields.
• Funding is much more tightly
controlled and hard to obtain
after failed launches.
• Legal modifications may
become barriers
39. Call to Action:
Source a New Field of Research ‘ML Integration’
• HCI, Team Research, Data
Science, AI, Psychology, User
Research & Application Fields are
all trying to understand
integrating ML into Human
Decision Systems …
independently and slowly
• Binding efforts into a single
discipline will rapidly increase
development and possibly meet
demand
* I am not qualified to give this talk … but who is?
40. Bootstrapping ML Integration
Funding Sources {Military, Accenture,
Academia, …?}
Initial Areas of Research:
• How to calculate the speed and
accuracy of large distributed
human + ML decision systems
• How to safely train and roll out
new decision processes to experts
• How to fairly explain a ML
decision. Beyond explainable AI.
• Design and run experiments in
these systems
Machine Learning
DS
AI
Politics
Team Research
Effective
ML Integration
41. Call to Action: Implement Guardrail Metrics for
Vulnerable Members of the Population
• Guardrail Metrics are
used to allow ML to
optimize as much as it
can within a specified
business boundary
• Let’s define a boundary
in tech to not make
systems worse for black
girls