The document discusses metrics for measuring artificial intelligence research and development, noting that AI research can be structured into 7 clusters and that globally AI research output has grown significantly each year across these clusters. It also outlines using machine learning techniques to define and structure the AI field by training a classifier to distinguish AI papers and using keyword co-occurrence to organize the different areas of research. The main data source discussed for quantitative analysis of AI research is Scopus, which contains over 70 million academic articles, authors and affiliations.
Breakout 1. Research and Development, including Technical Performance.
1. 30 October 2019
Ashley J. Llorens
Chief, Intelligent Systems Center
Johns Hopkins Applied Physics Laboratory
www.jhuapl.edu/isc
Technology Vectors for Intelligent Systems
HAI AI Index – Research and Development Breakout Session
2. A Systems View of Artificial Intelligence
• An intelligent system is an agent that
has the ability to perceive its
environment, decide upon a course of
action, act within a framework of
acceptable actions, and team with
humans and other agents to accomplish
a human-specified mission.
• Even when performing tasks
autonomously, an intelligent system is
always part of a human-machine team.
• To facilitate effective delegation of tasks
to an agent, humans must have
appropriately calibrated trust in the
agent’s capabilities.
• We see it as imperative that
advancements in AI and associated
metrics span these key attributes of
intelligent system capabilities: perceive,
decide, act, team, trust.
2Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
An AI-assisted handshake at JHU/APL’s Intelligent Systems Center
3. 3
Envisioned Futures Enabled by Intelligent Systems
Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
• Over the past year, the Johns
Hopkins Applied Physics Laboratory
(JHU/APL) performed an analysis of
envisioned futures for national
security, space exploration and
human health that could potentially
be enabled by targeted
advancements in intelligent
systems.
• This effort has produced four
essential technology vectors to
guide the progress of artificial
intelligence in the coming decades
towards addressing critical national
global challenges.
4. 4
Technology Vector 2: Superhuman decision-making and autonomous action:
Systems that identify, evaluate, select, and execute effective courses of action with
superhuman speed and accuracy for real-world challenges
Technology Vector 1: Autonomous perception:
Systems that reason about their environment, focus on the mission-critical aspects of the
scene, understand the intent of humans and other machines and learn through exploration
Technology Vector 3: Human-machine teaming at the speed of thought:
Systems that understand human intent and work in collaboration with humans to perform
tasks that are difficult or impossible for humans to carry out with speed and accuracy
Technology Vector 4: Safe and assured operation:
Systems that are robust to real-world perturbation and resilient to adversarial attacks with
ethical reasoning and goals that are guaranteed to remain aligned with human intent
Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
5. Machines select and perform appropriate behaviors
5Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
Collaborating to Advance Artificial Intelligence
• JHU/APL aims to accelerate progress
along these through our own research
and by engaging the broader
ecosystem.
• Challenge problems are an important
tool for sparking collaboration towards
key advancements.
• Reconnaissance Blind Chess was
crafted with this mind and is a
featured challenge at this year’s
Neural Information Processing
Systems (NeurIPS 2019).
• We see our Technology Vectors as
focal points for charting the landscape
of emerging developments in AI while
identifying gaps and informing future
investments and policy development.
7. Towards more meaningful
evaluations in AI
Christopher Potts
Stanford Linguistics
AI Index Workshop on Measurement in AI Policy
Special thanks to Atticus Geiger and
Robin Jia for helpful discussion!
8. Standard evaluations
1. Create a dataset from a single process
2. Divide the dataset into disjoint train and test sets, and set
the test set aside.
3. Develop systems on the train set.
4. Only after all system development is complete, evaluate
the systems based on accuracy on the test set.
5. Report the results as providing an estimate of
the system’s capacity to generalize.
10. The Natural Language Inference (NLI) task
Premise Relation Hypothesis
1. turtle contradicts linguist
2. A turtle danced. entails A turtle moved.
4. Some turtles walk. neutral Some rabbits move.
5.
James Byron Dean refused to move
without blue jeans.
entails
James Dean didn’t dance without
pants.
6.
Mitsubishi Motors Corp’s new vehicle
sales in the US fell 46 percent in June.
contradicts Mitsubishi’s sales rose 46 percent.
12. The best NLI systems fail on mildly adversarial tests
Premise Relation Hypothesis
Train
A little girl kneeling in the dirt crying.
entails A little girl is very sad.
Adversarial entails A little girl is very unhappy.
Premise Relation Hypothesis
Train
A woman is pulling a child on a sled
in the snow.
entails
A child is sitting on a sled in the
snow.
Adversarial
A child is pulling a woman on a sled
in the snow.
neutral
13. The Stanford Question Answering Dataset (SQuAD)
Passage
Peyton Manning became the first quarterback ever to lead two different
teams to multiple Super Bowls. He is also the oldest quarterback ever to play
in a Super Bowl at age 39. The past record was held by John Elway, who led
the Broncos to victory in Super Bowl XXXIII at age 38 and is currently
Denver’s Executive Vice President of Football Operations and General
Manager.
Question
What is the name of the quarterback who was 38 in Super Bowl XXXIII?
Answer
John Elway
14.
15. Training example
Passage
Peyton Manning became the first quarterback ever to lead two different
teams to multiple Super Bowls. He is also the oldest quarterback ever to play
in a Super Bowl at age 39. The past record was held by John Elway, who led
the Broncos to victory in Super Bowl XXXIII at age 38 and is currently
Denver’s Executive Vice President of Football Operations and General
Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl
XXXIV.
Question
What is the name of the quarterback who was 38 in Super Bowl XXXIII?
Answer
John Elway
16. Adversarial test example (Jia et al., EMNLP 2017)
Passage
Peyton Manning became the first quarterback ever to lead two different
teams to multiple Super Bowls. He is also the oldest quarterback ever to play
in a Super Bowl at age 39. The past record was held by John Elway, who led
the Broncos to victory in Super Bowl XXXIII at age 38 and is currently
Denver’s Executive Vice President of Football Operations and General
Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl
XXXIV.
Question
What is the name of the quarterback who was 38 in Super Bowl XXXIII?
Answer
John Elway Model Prediction: Jeff Dean
17. Devastating effects
The average performance of 16
published models trained on
SQuAD drops from a 75% F1
score to a 36% F1 score
System performance is also
shuffled, suggesting a
certain brittleness.
18. Measuring human performance
Premise Relation Hypothesis
1. turtle linguist
2. A turtle danced. A dog jumped.
4. A photo of a race horse. A photo of an athlete
5. A chef using a barbecue. A person using a machine.
6.
Mitsubishi Motors Corp’s new vehicle
sales in the US fell 46 percent in June.
Mitsubishi’s sales rose 46 percent.
contradicts
???
neutral
???
???
Our human tasks are machine tasks and therefore
understate human performance.
19. The Turing Test
A machine’s behavior is intelligent if it can trick a human
interrogator into thinking it is human using only conversation.
20. People are bad at the Turing Test!
Report from the first Turing Test (Schieber 1994)
Cynthia Clay, the Shakespeare aficionado, was thrice misclassified
as a computer. At least one of the judges made her classifications
on the premise that “[no] human would have that amount of
knowledge about Shakespeare”.
Turing Test event at the University of Reading
“A computer program called Eugene Goostman, which simulates a
13-year-old Ukrainian boy, is said to have passed the Turing test”
21. Somewhere between accuracy and Turing tests
Can a system perform more accurately on a friendly test set than a human
performing that same machine task? (Standard)
Can a system perform like a human in open-ended adversarial communication?
(Turing test)
Thanks!
Can a system behave systematically (even if it’s not accurate)?
Can a system assess its own confidence – know when not to make a prediction?
Can a system make people happier and more productive?
22. Dewey Murdick
Director of Data Science
dewey.murdick@georgetown.edu
https://cset.georgetown.edu
1
Highlighting ongoing work done in collaboration with Michael
Page, Daniel Chou, James Dunham, and Jennifer Melot
23. Connecting policymakers to high-quality analysis of emerging
technologies and their security implications (initial focus on AI)
#Nonpartisan #EmergingTech #AIpolicy
#security #analysis #ML #AI
2
PC: Georgetown University
24. Questions drive CSET… for example:
1. Will the big tech companies dominate the frontier of AI R&D in
3-5 years?
2. What impact does private-sector AI innovation have on a
country’s hard and soft power?
3. How much progress on AI in China is due to indigenous research
vs. legal and/or extralegal tech transfers?
4. What does collaboration with the defense-sector mean for a tech
company (within and outside of the HQ country)?
5. Will China develop an indigenous semiconductor industry
competitive with the US?
3
25. Questions drive CSET… for example:
1. Will the big tech companies dominate the frontier of AI R&D in
3-5 years?
2. What impact does private-sector AI innovation have on a
country’s hard and soft power?
3. How much progress on AI in China is due to indigenous research
vs. legal and/or extralegal tech transfers?
4. What does collaboration with the defense-sector mean for a tech
company (within and outside of the HQ country)?
5. Will China develop an indigenous semiconductor industry
competitive with the US?
4
Measures and metrics stay linked to questions
(we think metrics without context can lead to confusion in DC)
26. Example 1 - Will the big tech companies dominate
the frontier of AI R&D in 3-5 years?
● Research output over time
○ Industry vs. academia: % of top papers (by citation and venue)
○ Within industry: % of top industry papers (by citation and venue)
● Talent acquisition type and hiring rates over time
○ Absolute and relative number of job postings by corporation and AI-relevant skill sets
○ Fraction of top-tier AI talent within the industry (e.g., résumés and CVs)
● Investment, funding flows, and market share over time
○ Absolute and relative measures for research and development grants & contracts
○ Public and private company investments (e.g., M&A, private equity, etc.)
○ Number of innovative product releases (w/ AI-integration), market type and share, etc.
● Calibrated community-of-practice-based technical forecasts
○ Probability community will mature (e.g., workforce size, corporate involvement, investment)
○ Forecasted applications; community research level, technology readiness, horizon 1-3, etc.
Note: Developing cross-source indicators (e.g., fusion by organization)
5
27. ● Hard Power - Candidate Measures and Metrics
○ Flow of private-sector talent to defense agencies or defense contractors
○ Level of AI RDT&E investment (e.g., funding, staff) by defense agencies
○ Number and fraction of top AI companies that take defense contracts
○ Number of defense systems (develop, deploy) that apply AI capabilities
● Soft Power - Candidate Measures and Metrics
○ Presence at top international ML conf. & international collaboration rates
○ Fraction of AI workforce trained within a given country
○ Net skilled talent flow in/out of country; fraction of foreign talent that
emigrate
○ Role in establishing international governance structures and norms for AI
Example 2 - What impact does private-sector AI
innovation have on a country’s hard & soft power?
6
28. Active Lines of Research (AI Focus)
7
AI Applications
& Implications
Competitiveness
State of Play
Forecasting
Talent
Investment
Hardware
Data, algorithms & models Alliances
AI safety
Weapons
Military power
Cyber operations
29. Fusion of foreign and domestic S&T data sources
3. Workforce / Talent
○ Job Postings (English, Chinese)
○ CVs and Resume Data
○ FOIA Visa / Immigration Data / Port
of Entry Data (English)
4. Analyst-directed data sources
○ Targeted Surveys
○ Human Annotated Data
○ Prioritized Translations
○ Intent / Policy Docs, etc.
And more...
8
1. Technical Text
○ Scholarly Literature (English,
Chinese, Russian)
○ Dissertations & Theses (Chinese,
English, etc.)
○ Tech News (Chinese, Worldwide)
○ Patents (Worldwide)
○ News wire (Worldwide)
2. Worldwide Funding
○ Grant Funding
○ Financial Transactions for Publicly
and Privately-held Corporations
○ Venture Financial Transactions
○ Spending by governments
30. Upcoming
What we’re doing
1. Releasing analytic reports
2. Launch fortnightly
e-newsletter
3. Acquiring and improving
relevant data sets
4. Establishing and calibrating a
forecasting capability
5. Next CSET Seminar, Remco
Zwetsloot, Nov 20 (in DC)
What you can do
1. Subscribe to our e-newsletter
at cset.georgetown.edu
2. Tell us how we can help --
what are your AI-related
questions, and what
knowledge gaps have you
seen?
3. Help develop new indicator
features & language models
4. Help develop good
measures and metrics that
answer key AI questions
9
37. Anyone with knowledge of computer science research will see these rankings for
what they are – nonsense – and ignore them. But others may be seriously misled.
38. It is unreasonable to expect that departments half-way around the world will have
anything close to an accurate assessment of each other
…the methodology makes inferences from the wrong data without transparency
and, consequently, it arrives at an absurd ranking.
45. 30 October 2019
Maria de Kleijn, Senior Vice President Analytical Services
Artificial Intelligence
Peer reviewed research – volume
and quality metrics
46. | 2
Experts agree there is no common definition on AI
“There is no
commonly
agreed
ontology for
AI”
“It’s just
statistics on
steroids”
“An umbrella term to
describe the capability to
make computers apply
judgment as a human
being would”
“Many people
say AI when
they actually
mean machine
learning”
48. | 4
Data on peer-reviewed articles and conference
proceedings from Scopus
Article
70+ million
Journal, conference,
& Book records
Author
16+ million
Author profiles
(active)
Affiliation
70,000+
Affiliation profiles
Other sources used for
quantitative analysis
• Preprint servers (arXiv)
• PlumX dashboard
• Online competitions (Kaggle)
• ScienceDirect
• Graduate information (CAS,
China)
76
million
Items
16
million
Author profiles
~70,000
Affiliation
Profiles
1.4 billion cited
references
dating back to 1970
49. | 5
Globally, AI structures
into seven research
clusters
Search and
Optimization
Fuzzy
Systems
Planning and
Decision
Making
Natural Language
Processing and
Knowledge
Representation Computer Vision
Neural
Networks
Machine
Learning and
Probabilistic
Reasoning
Using AI to define and
structure AI
• Trained classifier to
distinguish AI papers
from non-AI papers
• Supervised learning
using keyword co-
occurrence to
structure the field
53. | 9
AI research is found in computer
science, and in application areas
like medicine, energy,
biochemistry
Topic cluster “semantics;
models; recommender systems”
55. | 11
Conclusion
• Tracking research in AI poses particular challenges, that can be
overcome with machine learning – “using AI to define AI”
• It takes a well structured database – linking articles to authors and
institutions – to get insights beyond simple volume metrics
• Scientometrics can help answer key policy questions like brain
drain/gain and the role of corporates
• AI research moving from ‘core’ computer science to application fields
is visible in the data
• Insights go beyond metrics!
56. | 12
Available resources
AI Resource Center:
https://www.elsevier.com/connect/ai-
resource-center
Download AI Report:
https://www.elsevier.com/research-
intelligence/ai-report
59. | 15
w
How is AI
being taught?
How is AI
researched?
How is AI being
talked about in
media?
How is AI being
described in
patents?
Achieving policy objectives requires actions across sectors
61. | 17
Keywords shared
across all 4
perspectives:
• Artificial Intelligence
• Deep Learning
• Machine Learning
• Neural Network
• Reinforcement Learning
• Speech Recognition`
AI seems to lack a common language
Teaching
268
Media
82
Research
42
Industry
641
64. Motivation
How do we break down “AI”
and search for its subfields?
How do we discover novel
research as it happens?
How do we enable
policymakers search for
specialised topics without the
help of experts?
65. Motivation What is arXlive?
How do we break down “AI”
and search for its subfields?
How do we discover novel
research as it happens?
How do we enable
policymakers search for
specialised topics without the
help of experts?
A scalable, shareable and
flexible open source platform
for real-time monitoring of
research activity in arXiv
preprints.
66. arXlive: a data analysis and production system
Abstract
Authors
67. arXlive: a data analysis and production system
Abstract
GeographyAffiliationsAuthors
68. arXlive: a data analysis and production system
Abstract
GeographyAffiliationsAuthors
Search
Novelty
model
HierarXy
69. arXlive: a data analysis and production system
Abstract
GeographyAffiliationsAuthors
Search
Novelty
model
HierarXy
Query
expansion
Keyword
Factory
70. arXlive: a data analysis and production system
Abstract
GeographyAffiliationsAuthors
Search
Novelty
model
HierarXy
Query
expansion
Keyword
Factory
Topic
model
Deep
learning
papers
Deep Learning,
Deep Change
71. Search for arXiv papers using a
query expansion approach.
Filter results by publication date,
citations, geography, discipline,
arXiv category and novelty.
Novelty: How dissimilar is a paper
from its most similar publications?
HierarXy
72. Keyword Factory
What else should I be searching
for? Generate lists of relevant
keywords based on arXiv data,
without prior knowledge.
73. Real-time update of papers
Daily updates of our Deep learning,
deep change? Mapping the development of
the Artificial Intelligence General Purpose
Technology paper.
Why?
● Reduce overheads; Policy
makers can find the most
up-to-date results on arXlive.
● Robustness by design.
● Log unexpected changes as
they happen.
74. Collect the full text of each publication.
● Track funding in AI research by parsing the paper
acknowledgements.
● Identify the inputs and outputs of AI systems.
● Develop a semantic search engine to enable long text queries.
Real-time updates of our Gender Diversity in AI Research paper.
Incorporate additional altmetrics for arXiv papers.
Visual exploration of the search space.
Next steps
76. Natural Language Understanding and Inference:
Benchmarks, Resources, and Approaches
Shane Storks (University of Michigan)
Qiaozi Gao (Michigan State University)
Joyce Y. Chai (University of Michigan)
77. Understanding Natural Language
● Benchmarks that require deep language understanding that goes beyond
what’s explicitly written, and rely on inference and knowledge of the world.
● Knowledge
○ linguistic knowledge (e.g., Penn Treebank, WordNet)
○ common knowledge (e.g., Freebase, DBpedia, YAGO)
○ commonsense knowledge (e.g., ConceptNet, ATOMIC)
"Jack needed some money, so he went and shook his piggy bank.
He was disappointed when it made no sound."
- Why was Jack disappointed? (Minsky, 2000)
80. Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
- The trophy would not fit in the brown
suitcase because it was too big.
- What was too big?
A. The trophy
B. The suitcase
81. Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
- The trophy would not fit in the brown
suitcase because it was too small.
- What was too small?
A. The trophy
B. The suitcase
82. Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
- Which of these would let the most
heat travel through?
A. a new pair of jeans.
B. a steel spoon in a cafeteria.
C. a cotton candy at a store.
D. a calvin klein cotton hat.
Evidence: Metal is a thermal conductor.
83. Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
- Text: A black race car starts up in front
of a crowd of people.
- Hypothesis: A man is driving down a
lonely road.
- Label: contradiction
84. Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
I knocked on my neighbor’s door.
What happened as result?
A. My neighbor invited me in.
B. My neighbor left his house.
86. Creating Benchmarks: Criteria and Considerations
● Task Format
○ Classification tasks
○ Open-ended tasks
● Evaluation Scheme
○ Evaluation metrics: objective and easy to calculate
○ Human performance measurement
● Avoiding Data Biases
○ Label distribution bias
○ Question Type Bias in QA
○ Superficial Correlation Bias (gender bias, human stylistic artifacts)
87. Approaches: General Architecture
● Symbolic approaches
● Statistical approaches
● Latest SOTA use deep neural
network (e.g., transformer) with
built-in pre-trained contextual
embeddings
○ Performance keeps increasing
○ Exceeding human performance
sometimes
88. Performance Trends
● Many factors may affect progress
on benchmarks
○ Actual task difficulty
○ Data size
○ Year released
○ Number of people working on the
benchmark
○ Data bias
● Performance should be interpreted
with caution
89. Future Questions
● Doe the benchmark performance really reflect the machine inference
abilities?
● How to explain model behaviors so that humans can understand the
underlying inference process?
● How can we make better use of available knowledge resources?
● How can we train energy/cost efficient models?
○ How the Transformers broke NLP leaderboards - Rogers, 2019
○ Green AI - Schwartz et al., 2019
90. Creating Benchmarks: Data Biases
● Label Distribution Bias
○ relatively easy to avoid: an equal number of examples for each class
● Question Type Bias in QA
○ distribution of the first words of questions (e.g., CoQA, CommonsenseQA)
○ manually analysis of question categories (e.g., Squad 2.0, ARC)
○ predefined question types (e.g., ProPara)
● Superficial Correlation Bias
○ e.g., gender bias, human stylistic artifacts
○ relatively difficult to avoid
○ adversarial filtering process (e.g., SWAG)
91. Benchmarks
● Turing Test
○ encouraging machines to deceive humans
○ no feedback on a continuous scale to allow for incremental development
● Early NLP Benchmarks
○ Part-of-speech Tagging
○ Named Entity Recognition
○ Coreference Resolution
○ Information Extraction Jyc: delete this slide
92. Thank you!
Jyc: at least show two or three slides about approaches:
- One slide on the general architecture
- One slide on example performance? Shane is making a
figure for that, discuss the differences between human
performance and model performance.
Also need a slide to summarize:
- What pending questions from the exercise on
benchmarks.
- What should be some ideas for future direction.
93. Knowledge Base
Humans perform inference based on vast amount of knowledge about how the
world works. To support machines’ inference ability, a parallel ongoing research
effort in the last several decades is the development of various knowledge
resources.
97. PROGRESS IN
COMMERCIAL MACHINE
TRANSLATION SYSTEMS
by Konstantin Savenkov,
Ph.D., CEO Intento
October 29-30, 2019
Stanford University, Human-Centered Artificial Intelligence (HAI) and AI Index
Workshop on Measurement in AI Policy: Opportunities and Challenges
111. THANKS!
by Konstantin Savenkov,
Ph.D., CEO Intento
October 29-30, 2019
Stanford University, Human-Centered Artificial Intelligence (HAI) and AI Index
Workshop on Measurement in AI Policy: Opportunities and Challenges
113. Quantifying Algorithmic
Improvements over Time
Lars Kotthoff
University of Wyoming
larsko@uwyo.edu1
Measurement in AI Policy Workshop, 30 October 2019
1
Based on Kotthoff, Lars, Alexandre Fréchette, Tomasz P. Michalak, Talal
Rahwan, Holger H. Hoos, and Kevin Leyton-Brown. “Quantifying Algorithmic
Improvements over Time.” In 27th International Joint Conference on Artificial
Intelligence (IJCAI) Special Track on the Evolution of the Contours of AI, 2018.
114. Key Ideas
▷ science is not a horse race
▷ reward new ideas and complementary approaches
▷ stand on the shoulders of giants, and give credit to those
giants
1
115. Contributions – Standalone Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
Standalone Performance 2
116. Contributions – Marginal Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
98900
18
5
5
3
1
0
0
0
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
median 3 random (1978)
median 9 random (1993)
mid (1978)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
Standalone Performance Marginal Performance
3
117. Contributions – Shapley Value
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
100267058
100167412
100153715
100153384
100151097
100131186
99434662
98059604
84173
98900
18
5
5
3
1
0
0
0
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
median 3 random (1978)
median 9 random (1993)
mid (1978)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
Standalone Performance Shapley Value Marginal Performance 4
118. Contributions – Temporal Shapley Value
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
100267058
100167412
100153715
100153384
100151097
100131186
99434662
98059604
84173
405450356
392238462
671833
98900
57198
50411
22506
10074
2550
13212030
671833
98900
20497
15703
6907
552
541
137
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
median 3 random (1978)
median 3 (1978)
median 9 random (1993)
first (1961)
Standalone Performance
Shapley Value
Temporal Shapley Value
Temporal Marginal Performance
5
123. MiniZinc Competition Over Time
0
50000
100000
150000
200000
2014 2015 2016
year
SumoftemporalShapleyValues
10
124. Summary
▷ standalone performance does not indicate how algorithms
complement each other
▷ marginal performance is not fair
▷ Shapley Value
▷ provides better characterization of algorithms’ performance
▷ rewards algorithms that introduce novel and complementary
concepts
▷ enables better analysis of algorithms’ performance
▷ Temporal Shapley Value
▷ takes when an algorithm was conceived into account
▷ all desirable properties of Shapley Value
▷ rewards earlier algorithms, which may have inspired later
algorithms
11
125. Contributions – Temporal Marginal Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
13212030
671833
98900
20497
15703
6907
552
541
137
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
median 3 random (1978)
median 3 (1978)
median 9 random (1993)
first (1961)
Standalone Performance Temporal Marginal Performance 12
126. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
Marcelo O.R. Prates morprates@inf.ufrgs.br
Pedro H.C. Avelar phcavelar@inf.ufrgs.br
Luis C. Lamb lamb@inf.ufrgs.br
October 2019
1
127. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
ASSESSING GENDER BIAS IN MACHINE
TRANSLATION
This presentation is based on our work “Assessing Gender Bias in Machine
Translation – A Case Study with Google Translate”, (PRATES; AVELAR;
LAMB, 2019) and includes a short description of our work on Quantifying
the Role of Ethics in AI Research (PRATES; AVELAR; LAMB, 2018).
2
128. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHMACHINE BIAS
• Machine Bias is a topic of great interest in academia and industry.
• Biases have been identified in several systems (ANGWIN et al., 2016;
BOLUKBASI et al., 2016; CHO et al., 2019; GARCIA, 2016; MILLS,
2017; PAPENFUSS, 2017; WEBSTER et al., 2018; ZHAO et al.,
2018).
• “Including gender analysis in research can save us from
life-threatening errors.” (SCHIEBINGER, 2014)
• Thus, solving bias in AI systems is important to achieve a fairer
society.
3
129. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHBIAS IN WORD EMBEDDINGS
(BOLUKBASI et al., 2016) identified Biases in Word Embeddings and
argued that debiasing was necessary before applying these methods in real
world applications.
There have been hundreds of papers written about word em-
beddings and their applications (...). However, none of these
papers have recognized how blatantly sexist the embeddings are
and hence risk introducing biases of various types into real-world
systems.
(...)
One perspective on bias in word embeddings is that it merely
reflects bias in society, and therefore one should attempt to debias
society rather than word embeddings. However, by reducing the
bias in today’s computer systems (or at least not amplifying the
bias), which is increasingly reliant on word embeddings, in a
small way debiased word embeddings can hopefully contribute
to reducing gender bias in society. At the very least, machine
learning should not be used to inadvertently amplify these biases,
as we have seen can naturally happen. 4
130. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHGENDER BIAS IN MACHINE TRANSLATION
Figure: Example translations which were trending in social media.
5
131. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHMAIN IDEAS
• There was great social media interest on solving MT gender bias for
professions, in particular in the translation from gender neutral
languages.
• We this issue, by providing a transparent way of assessing gender bias
in MT systems.
• We provide a case study with a widely used system and compare it
with real world gender distributions.
• Extra: We provide a similar study for adjectives.
6
132. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHDATA – LANGUAGES
• Languages
• With Gender Neutral Pronouns and supported by GT:
• Armenian, Basque, Bengali, Chinese – Mandarin (pinyin),
Estonian, Finnish, Hungarian, Japanese, Malay, Swahili,
Turkish, Yoruba.
• We did not include some GN Languages (Nepali, Korean and
Persian) due to difficulties in providing template/processing the
data.
7
133. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHDATA – OCCUPATIONS
• Labour Data
• Extracted from the U.S. Bureau of Labor Statistics (Bureau of
Labor Statistics, 2017)
• Manually curated.
• Most occupations had data on gender distribution.
• Missing data imputed as category aggregate. For example:
- The profession “Sociologists” doesn’t have enough data to
contain a percentage of female participation.
- Its % is imputed as the aggregate in its category “Life,
physical, and social science occupations”.
- Two thousand employed (sociologists), with 47.4% women
(from Life, physical, and social science occupations).
8
134. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHDATA – ADJECTIVES
• Adjectives
• Extracted from CoCA <https://corpus.byu.edu/coca/>
• Manually curated from the top 1,000 most frequent adjectives.
9
135. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – OCCUPATION CATEGORY
Category
Healthcare
Production
Education
Farming
Fishing
Forestry
Service
Construction
Extraction
Corporate
Arts
Entertainment
STEM
Legal
Neutral
Female
Male
Gender
0
50
100
%
Figure: Plot showing how different Occupation Categories have different
distributions of translation pronouns.
10
136. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – LANGUAGE
Language
Basque
Bengali
Yoruba
Chinese
Finnish
Hungarian
Turkish
Japanese
Estonian
Swahili
Armenian
Malay
Neutral
Female
Male
Gender
0
50
100
%
Figure: Plot showing how different Languages have different distributions of
translation pronouns.
11
137. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – GT VS REAL DISTRIBUTION
12-quantile
1 2 3 4 5 6 7 8 9 10 11 12
0
10
20
30
40
Frequency(%)
Google Translate Female %
BLS Female Participation %
Data
Figure: Plot showing severe underestimation of female participation.
12
138. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – ADJECTIVES
Adjective
Happy
Shy
Desirable
Sad
Dumb
Mature
Smart
Polite
Sympathetic
Loving
Modest
Wrong
Afraid
Innocent
Strong
Successful
Right
Brave
Cruel
Guilty
Proud
Neutral
Female
Male
Gender
0
50
100
%
Figure: Most adjectives seem to adopt male defaults, but some specific words
show certain trends, as “Guilty”, while some adjectives such as “shy” and
“happy” seem to skew less towards male translations. 13
139. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – IMPROVEMENTS IN GT
Figure: GT provided translation alternatives shortly after our paper.
14
140. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHLIMITATIONS
• None of us were speakers of the gender-neutral languages.
• None of us identified themselves as female.
• GT doesn’t provide confidence scores for words in the API .
• Our work was limited to a single template translation per word
(except for Bengali).
• The occupation list is from a single source (BLS).
• Occupations were forward translated to be back-translated again.
15
141. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHPOLICY SUGGESTIONS
• MT tools could provide alternative translations (GT has been updated
to include this).
• MT tools could provide confidence scores for individual words.
• Automatic evaluation can help detect bias in a system and call for
further action.
• Datasets could have a curated subset to enforce parity.
16
142. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES
• Related work: quantifying the role of ethics in AI research (PRATES;
AVELAR; LAMB, 2018)
• Searched for ethics related keywords in flagship conference abstracts
and titles.
• Although ethics is being more and more commonly discussed in
workshops, it is not typically discussed in the main flagship conference
tracks.
17
143. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES
Conferences
AAAI IJCAI NIPS ICML ICRA IROS
7, 179 7, 723 6, 509 3, 568 19, 368 15, 005
Journals
ACM
Trans.
Comm.
ACM
IEEE.
Com-
puter
JAIR IEEE
Trans.
AI
Artif.
Intell.
18, 199 11, 394 6, 694 972 10, 731 2, 766
Table: Sample sizes in number of papers for the analysed venues.
18
144. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES
0
0.002
0.004
0.006
0.008
0.01
0.012
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Averagenºofmatches
Year
AAAI
IJCAI
NIPS
ICML
ICRA
IROS
Figure: Frequency of the selected ethics-related keywords in each five year
interval in paper titles
19
145. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRELATED WORK
• Related work:
• (CHO et al., 2019) performed a similar evaluation for Korean on
three different translation tools, using multiple sentence
templates.
• (STANOVSKY; SMITH; ZETTLEMOYER, 2019) evaluated
gender bias for 8 languages and 6 MT systems for correct
translation alignments.
• (KUCZMARSKI; JOHNSON, 2018) proposed techniques to
produce both translations in all genders in the target language.
• (ZHAO et al., 2018; RUDINGER et al., 2018; WEBSTER et al.,
2018) provided corpora for pronoun resolution and assessing
gender bias.
20
146. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHCHO ET AL.
• Korean speakers
• Provided a way to test MT systems for the Korean language.
• Tested on 3 different MT systems.
• Used multiple sentence templates per pair
21
147. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHCHO ET AL.
Figure: Cho et al. tested on different systems, including GT and Naver Papago
(NP). Reuse of this image was kindly permitted by Cho et al.
22
148. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHSTANOVSKY, SMITH, ZETTLEMOYER
• (STANOVSKY; SMITH; ZETTLEMOYER, 2019) based their studies
in previous studies regarding Gender bias in coreference resolution
(ZHAO et al., 2018; RUDINGER et al., 2018).
• Tested on 6 different MT systems, 4 commercial ones.
• Tested sentences based on automatic tools and checking for gender
alignment between the source and target sentences.
• Also performed manual annotation for a small subset of 100 sentences
with 2 native annotators.
23
149. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHKUCZMARSKI, JOHNSON
• Proposed techniques to produce both translations in all genders in the
target language.
• In Summary:
• Identify if a translation query may need gendered translation.
• If so, translate the sentence forcing all possible genders in the
target language.
• Post-process to see if produced sentences are appropriate.
• Present gendered tuple to user if so, otherwise translate as
normal.
• Similar to what GT seems to have adopted.
24
150. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHGENDER BIAS CORPORA
• (ZHAO et al., 2018; RUDINGER et al., 2018; WEBSTER et al.,
2018) provided corpora for gendered pronoun resolution.
• Can be used to benchmark MT tools.
• Also identified and called biases to attention.
25
151. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHFUTURE WORK
• Future Work:
• We are not aware of a study similar to (CHO et al., 2019) for
the Persian or Nepali languages.
• Cho et al. are looking to expand their work to multiple
languages.
• We are expanding some of our experiments on bias in MT.
• We are - very - open to collaboration and suggestions.
26
152. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
THANK YOU.
Thank You!
Contacts:
morprates@inf.ufrgs.br
phcavelar@inf.ufrgs.br
lamb@inf.ufrgs.br
27
153. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
BIBLIOGRAPHY I
ANGWIN, J. et al. Machine bias: There’s software
used across the country to predict future criminals and
it’s biased against blacks. 2016. Last visited 2017-12-
17. Disponível em: <https://www.propublica.org/article/
machine-bias-risk-assessments-in-criminal-sentencing>.
BOLUKBASI, T. et al. Man is to computer programmer as woman
is to homemaker? debiasing word embeddings. In: NIPS. [S.l.: s.n.],
2016. p. 4349–4357.
Bureau of Labor Statistics. "Table 11: Employed persons by
detailed occupation, sex, race, and Hispanic or Latino ethnicity,
2017". [S.l.], 2017.
28
154. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
BIBLIOGRAPHY II
CHO, W. I. et al. On measuring gender bias in translation of
gender-neutral pronouns. In: Proceedings of the First Workshop
on Gender Bias in Natural Language Processing. Florence, Italy:
Association for Computational Linguistics, 2019. p. 173–181. Disponível
em: <https://www.aclweb.org/anthology/W19-3824>.
GARCIA, M. Racist in the machine: The disturbing implications of
algorithmic bias. World Policy Journal, Duke Univ Press, v. 33, n. 4,
p. 111–117, 2016.
KUCZMARSKI, J.; JOHNSON, M. Gender-aware natural language
translation. 2018.
MILLS, K.-A. ’Racist’ soap dispenser refuses to help dark-
skinned man wash his hands - but Twitter blames ’technology’.
2017. Last visited 2017-12-17. Disponível em: <http://www.mirror.co.
uk/news/world-news/racist-soap-dispenser-refuses-help-11004385>.
29
155. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
BIBLIOGRAPHY III
PAPENFUSS, M. Woman In China Says Colleague’s
Face Was Able To Unlock Her iPhone X. 2017. Last visited
2017-12-17. Disponível em: <http://www.huffpostbrasil.com/entry/
iphone-face-recognition-double_us_5a332cbce4b0ff955ad17d50>.
PRATES, M. O. R.; AVELAR, P. H.; LAMB, L. C. Assessing gender
bias in machine translation: a case study with google translate. Neural
Computing and Applications, Mar 2019. ISSN 1433-3058. Disponível
em: <https://doi.org/10.1007/s00521-019-04144-6>.
PRATES, M. O. R.; AVELAR, P. H. C.; LAMB, L. C. On quantifying
and understanding the role of ethics in AI research: A historical account
of flagship conferences and journals. In: GCAI. [S.l.]: EasyChair, 2018.
(EPiC Series in Computing, v. 55), p. 188–201.
RUDINGER, R. et al. Gender bias in coreference resolution. In:
NAACL-HLT (2). [S.l.]: Association for Computational Linguistics,
2018. p. 8–14.
30
156. GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
BIBLIOGRAPHY IV
SCHIEBINGER, L. Scientific research must take gender into account.
Nature, Nature Publishing Group, v. 507, n. 7490, p. 9, 2014.
STANOVSKY, G.; SMITH, N. A.; ZETTLEMOYER, L. Evaluating
gender bias in machine translation. In: ACL (1). [S.l.]: Association for
Computational Linguistics, 2019. p. 1679–1684.
WEBSTER, K. et al. Mind the gap: A balanced corpus of gendered
ambiguous pronouns. In: Transactions of the ACL. [S.l.: s.n.], 2018.
p. to appear.
ZHAO, J. et al. Gender bias in coreference resolution: Evaluation
and debiasing methods. In: NAACL-HLT (2). [S.l.]: Association for
Computational Linguistics, 2018. p. 15–20.
31