Breakout 1. Research and Development, including Technical Performance.

30 October 2019
Ashley J. Llorens
Chief, Intelligent Systems Center
Johns Hopkins Applied Physics Laboratory
www.jhuapl.edu/isc
Technology Vectors for Intelligent Systems
HAI AI Index – Research and Development Breakout Session

A Systems View of Artificial Intelligence
• An intelligent system is an agent that
has the ability to perceive its
environment, decide upon a course of
action, act within a framework of
acceptable actions, and team with
humans and other agents to accomplish
a human-specified mission.
• Even when performing tasks
autonomously, an intelligent system is
always part of a human-machine team.
• To facilitate effective delegation of tasks
to an agent, humans must have
appropriately calibrated trust in the
agent’s capabilities.
• We see it as imperative that
advancements in AI and associated
metrics span these key attributes of
intelligent system capabilities: perceive,
decide, act, team, trust.
2Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
An AI-assisted handshake at JHU/APL’s Intelligent Systems Center

3
Envisioned Futures Enabled by Intelligent Systems
Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
• Over the past year, the Johns
Hopkins Applied Physics Laboratory
(JHU/APL) performed an analysis of
envisioned futures for national
security, space exploration and
human health that could potentially
be enabled by targeted
advancements in intelligent
systems.
• This effort has produced four
essential technology vectors to
guide the progress of artificial
intelligence in the coming decades
towards addressing critical national
global challenges.

4
Technology Vector 2: Superhuman decision-making and autonomous action:
Systems that identify, evaluate, select, and execute effective courses of action with
superhuman speed and accuracy for real-world challenges
Technology Vector 1: Autonomous perception:
Systems that reason about their environment, focus on the mission-critical aspects of the
scene, understand the intent of humans and other machines and learn through exploration
Technology Vector 3: Human-machine teaming at the speed of thought:
Systems that understand human intent and work in collaboration with humans to perform
tasks that are difficult or impossible for humans to carry out with speed and accuracy
Technology Vector 4: Safe and assured operation:
Systems that are robust to real-world perturbation and resilient to adversarial attacks with
ethical reasoning and goals that are guaranteed to remain aligned with human intent
Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory

Machines select and perform appropriate behaviors
5Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
Collaborating to Advance Artificial Intelligence
• JHU/APL aims to accelerate progress
along these through our own research
and by engaging the broader
ecosystem.
• Challenge problems are an important
tool for sparking collaboration towards
key advancements.
• Reconnaissance Blind Chess was
crafted with this mind and is a
featured challenge at this year’s
Neural Information Processing
Systems (NeurIPS 2019).
• We see our Technology Vectors as
focal points for charting the landscape
of emerging developments in AI while
identifying gaps and informing future
investments and policy development.

Towards more meaningful
evaluations in AI
Christopher Potts
Stanford Linguistics
AI Index Workshop on Measurement in AI Policy
Special thanks to Atticus Geiger and
Robin Jia for helpful discussion!

Standard evaluations
1.  Create a dataset from a single process
2.  Divide the dataset into disjoint train and test sets, and set
the test set aside.
3.  Develop systems on the train set.
4.  Only after all system development is complete, evaluate
the systems based on accuracy on the test set.
5.  Report the results as providing an estimate of
the system’s capacity to generalize.

The Natural Language Inference (NLI) task
Premise Relation Hypothesis
1. turtle contradicts linguist
2. A turtle danced. entails A turtle moved.
4. Some turtles walk. neutral Some rabbits move.
5.
James Byron Dean refused to move
without blue jeans.
entails
James Dean didn’t dance without
pants.
6.
Mitsubishi Motors Corp’s new vehicle
sales in the US fell 46 percent in June.
contradicts Mitsubishi’s sales rose 46 percent.

Stanford Natural Language Inference Corpus (SNLI)

The best NLI systems fail on mildly adversarial tests
Train
A little girl kneeling in the dirt crying.
entails A little girl is very sad.
Adversarial entails A little girl is very unhappy.
Train
A woman is pulling a child on a sled
in the snow.
entails
A child is sitting on a sled in the
snow.
Adversarial
A child is pulling a woman on a sled
in the snow.
neutral

The Stanford Question Answering Dataset (SQuAD)
Passage
Peyton Manning became the first quarterback ever to lead two different
teams to multiple Super Bowls. He is also the oldest quarterback ever to play
in a Super Bowl at age 39. The past record was held by John Elway, who led
the Broncos to victory in Super Bowl XXXIII at age 38 and is currently
Denver’s Executive Vice President of Football Operations and General
Manager.
Question
What is the name of the quarterback who was 38 in Super Bowl XXXIII?
Answer
John Elway

Training example
Passage
Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl
XXXIV.
Question
Answer
John Elway

Adversarial test example (Jia et al., EMNLP 2017)
Passage
Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl
XXXIV.
Question
Answer
John Elway Model Prediction: Jeff Dean

Devastating effects
The average performance of 16
published models trained on
SQuAD drops from a 75% F1
score to a 36% F1 score
System performance is also
shuffled, suggesting a
certain brittleness.

Measuring human performance
1. turtle linguist
2. A turtle danced. A dog jumped.
4. A photo of a race horse. A photo of an athlete
5. A chef using a barbecue. A person using a machine.
6.
Mitsubishi Motors Corp’s new vehicle
sales in the US fell 46 percent in June.
Mitsubishi’s sales rose 46 percent.
contradicts
???
neutral
???
???
Our human tasks are machine tasks and therefore
understate human performance.

The Turing Test
A machine’s behavior is intelligent if it can trick a human
interrogator into thinking it is human using only conversation.

People are bad at the Turing Test!
Report from the first Turing Test (Schieber 1994)
Cynthia Clay, the Shakespeare aficionado, was thrice misclassified
as a computer. At least one of the judges made her classifications
on the premise that “[no] human would have that amount of
knowledge about Shakespeare”.
Turing Test event at the University of Reading
“A computer program called Eugene Goostman, which simulates a
13-year-old Ukrainian boy, is said to have passed the Turing test”

Somewhere between accuracy and Turing tests
Can a system perform more accurately on a friendly test set than a human
performing that same machine task? (Standard)
Can a system perform like a human in open-ended adversarial communication?
(Turing test)
Thanks!
Can a system behave systematically (even if it’s not accurate)?
Can a system assess its own confidence – know when not to make a prediction?
Can a system make people happier and more productive?

Dewey Murdick
Director of Data Science
dewey.murdick@georgetown.edu
https://cset.georgetown.edu
1
Highlighting ongoing work done in collaboration with Michael
Page, Daniel Chou, James Dunham, and Jennifer Melot

Connecting policymakers to high-quality analysis of emerging
technologies and their security implications (initial focus on AI)
#Nonpartisan #EmergingTech #AIpolicy
#security #analysis #ML #AI
2
PC: Georgetown University

Questions drive CSET… for example:
1. Will the big tech companies dominate the frontier of AI R&D in
3-5 years?
2. What impact does private-sector AI innovation have on a
country’s hard and soft power?
3. How much progress on AI in China is due to indigenous research
vs. legal and/or extralegal tech transfers?
4. What does collaboration with the defense-sector mean for a tech
company (within and outside of the HQ country)?
5. Will China develop an indigenous semiconductor industry
competitive with the US?
3

Questions drive CSET… for example:
1. Will the big tech companies dominate the frontier of AI R&D in
3-5 years?
2. What impact does private-sector AI innovation have on a
country’s hard and soft power?
3. How much progress on AI in China is due to indigenous research
vs. legal and/or extralegal tech transfers?
4. What does collaboration with the defense-sector mean for a tech
company (within and outside of the HQ country)?
5. Will China develop an indigenous semiconductor industry
competitive with the US?
4
Measures and metrics stay linked to questions
(we think metrics without context can lead to confusion in DC)

Example 1 - Will the big tech companies dominate
the frontier of AI R&D in 3-5 years?
● Research output over time
○ Industry vs. academia: % of top papers (by citation and venue)
○ Within industry: % of top industry papers (by citation and venue)
● Talent acquisition type and hiring rates over time
○ Absolute and relative number of job postings by corporation and AI-relevant skill sets
○ Fraction of top-tier AI talent within the industry (e.g., résumés and CVs)
● Investment, funding flows, and market share over time
○ Absolute and relative measures for research and development grants & contracts
○ Public and private company investments (e.g., M&A, private equity, etc.)
○ Number of innovative product releases (w/ AI-integration), market type and share, etc.
● Calibrated community-of-practice-based technical forecasts
○ Probability community will mature (e.g., workforce size, corporate involvement, investment)
○ Forecasted applications; community research level, technology readiness, horizon 1-3, etc.
Note: Developing cross-source indicators (e.g., fusion by organization)
5

● Hard Power - Candidate Measures and Metrics
○ Flow of private-sector talent to defense agencies or defense contractors
○ Level of AI RDT&E investment (e.g., funding, staff) by defense agencies
○ Number and fraction of top AI companies that take defense contracts
○ Number of defense systems (develop, deploy) that apply AI capabilities
● Soft Power - Candidate Measures and Metrics
○ Presence at top international ML conf. & international collaboration rates
○ Fraction of AI workforce trained within a given country
○ Net skilled talent flow in/out of country; fraction of foreign talent that
emigrate
○ Role in establishing international governance structures and norms for AI
Example 2 - What impact does private-sector AI
innovation have on a country’s hard & soft power?
6

Active Lines of Research (AI Focus)
7
AI Applications
& Implications
Competitiveness
State of Play
Forecasting
Talent
Investment
Hardware
Data, algorithms & models Alliances
AI safety
Weapons
Military power
Cyber operations

Fusion of foreign and domestic S&T data sources
3. Workforce / Talent
○ Job Postings (English, Chinese)
○ CVs and Resume Data
○ FOIA Visa / Immigration Data / Port
of Entry Data (English)
4. Analyst-directed data sources
○ Targeted Surveys
○ Human Annotated Data
○ Prioritized Translations
○ Intent / Policy Docs, etc.
And more...
8
1. Technical Text
○ Scholarly Literature (English,
Chinese, Russian)
○ Dissertations & Theses (Chinese,
English, etc.)
○ Tech News (Chinese, Worldwide)
○ Patents (Worldwide)
○ News wire (Worldwide)
2. Worldwide Funding
○ Grant Funding
○ Financial Transactions for Publicly
and Privately-held Corporations
○ Venture Financial Transactions
○ Spending by governments

Upcoming
What we’re doing
1. Releasing analytic reports
2. Launch fortnightly
e-newsletter
3. Acquiring and improving
relevant data sets
4. Establishing and calibrating a
forecasting capability
5. Next CSET Seminar, Remco
Zwetsloot, Nov 20 (in DC)
What you can do
1. Subscribe to our e-newsletter
at cset.georgetown.edu
2. Tell us how we can help --
what are your AI-related
questions, and what
knowledge gaps have you
seen?
3. Help develop new indicator
features & language models
4. Help develop good
measures and metrics that
answer key AI questions
9

“Moore’s Law” of
Academic Knowledge:
> 1 M titles/month

10
100
1000
10000
100000
1000000
1960
1962
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
WHAT DO AI WINTERS LOOK LIKE?
Deep Learning
Year
AI
Artificial Neural Network
PublicationCount

https://www.usnews.com/educ
ation/best-global-
universities/search?region=&subj
ect=computer-science&name=
(2019)

https://www.usnews.com/educ
ation/best-global-
universities/search?region=&subj
ect=computer-science&name=
(2020)

Anyone with knowledge of computer science research will see these rankings for
what they are – nonsense – and ignore them. But others may be seriously misled.

It is unreasonable to expect that departments half-way around the world will have
anything close to an accurate assessment of each other
…the methodology makes inferences from the wrong data without transparency
and, consequently, it arrives at an absurd ranking.

Inference & Reasoning
Machine Readers (NLP)
Reinforcement Learning
𝑐(𝑡)
𝑠(𝑡)
𝑠(𝑡 − 𝑇)
Knowledge graph
Search results &
Recommendation
Citation behavior
Existing ranking
AI Components in Microsoft Academic
Entities+ Relations

Past 10YearsAllTime Past 5Years

30 October 2019
Maria de Kleijn, Senior Vice President Analytical Services
Artificial Intelligence
Peer reviewed research – volume
and quality metrics

| 2
Experts agree there is no common definition on AI
“There is no
commonly
agreed
ontology for
AI”
“It’s just
statistics on
steroids”
“An umbrella term to
describe the capability to
make computers apply
judgment as a human
being would”
“Many people
say AI when
they actually
mean machine
learning”

| 3
AI corpus definition at article level

| 4
Data on peer-reviewed articles and conference
proceedings from Scopus
Article
70+ million
Journal, conference,
& Book records
Author
16+ million
Author profiles
(active)
Affiliation
70,000+
Affiliation profiles
Other sources used for
quantitative analysis
• Preprint servers (arXiv)
• PlumX dashboard
• Online competitions (Kaggle)
• ScienceDirect
• Graduate information (CAS,
China)
76
million
Items
16
million
Author profiles
~70,000
Affiliation
Profiles
1.4 billion cited
references
dating back to 1970

| 5
Globally, AI structures
into seven research
clusters
Search and
Optimization
Fuzzy
Systems
Planning and
Decision
Making
Natural Language
Processing and
Knowledge
Representation Computer Vision
Neural
Networks
Machine
Learning and
Probabilistic
Reasoning
Using AI to define and
structure AI
• Trained classifier to
distinguish AI papers
from non-AI papers
• Supervised learning
using keyword co-
occurrence to
structure the field

| 6
Source: Scopus
Research output per year, per cluster, globally

| 7
US: strong corporate sector
Key Contribution
(academic and corporate
institutions)
Number of publications
(all)
Field-Weighted Citation Impact

| 8
US: attracting overseas talent to its corporations
-318

| 9
AI research is found in computer
science, and in application areas
like medicine, energy,
biochemistry
Topic cluster “semantics;
models; recommender systems”

| 10
“Semantics; models; recommender systems” in itself
has interdisciplinary components

| 11
Conclusion
• Tracking research in AI poses particular challenges, that can be
overcome with machine learning – “using AI to define AI”
• It takes a well structured database – linking articles to authors and
institutions – to get insights beyond simple volume metrics
• Scientometrics can help answer key policy questions like brain
drain/gain and the role of corporates
• AI research moving from ‘core’ computer science to application fields
is visible in the data
• Insights go beyond metrics!

| 12
Available resources
AI Resource Center:
https://www.elsevier.com/connect/ai-
resource-center
Download AI Report:
https://www.elsevier.com/research-
intelligence/ai-report

| 15
w
How is AI
being taught?
How is AI
researched?
How is AI being
talked about in
media?
How is AI being
described in
patents?
Achieving policy objectives requires actions across sectors

| 17
Keywords shared
across all 4
perspectives:
• Artificial Intelligence
• Deep Learning
• Machine Learning
• Neural Network
• Reinforcement Learning
• Speech Recognition`
AI seems to lack a common language
Teaching
268
Media
82
Research
42
Industry
641

| 18
US: ability to also retain strong researchers
Migratory
Outflow
Migratory Inflow
Transitory
Sedentary
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00
Relative productivity
Relative impact versus relative productivity

Kostas Stathoulopoulos
arXlive
Real-time monitoring of research activity in arXiv

Motivation
How do we break down “AI”
and search for its subfields?
How do we discover novel
research as it happens?
How do we enable
policymakers search for
specialised topics without the
help of experts?

Motivation What is arXlive?
How do we break down “AI”
and search for its subfields?
How do we discover novel
research as it happens?
How do we enable
policymakers search for
specialised topics without the
help of experts?
A scalable, shareable and
flexible open source platform
for real-time monitoring of
research activity in arXiv
preprints.

arXlive: a data analysis and production system
Abstract
Authors

Abstract
GeographyAffiliationsAuthors

Abstract
Search
Novelty
model
HierarXy

Abstract
Search
Novelty
model
HierarXy
Query
expansion
Keyword
Factory

Abstract
Search
Novelty
model
HierarXy
Query
expansion
Keyword
Factory
Topic
model
Deep
learning
papers
Deep Learning,
Deep Change

Search for arXiv papers using a
query expansion approach.
Filter results by publication date,
citations, geography, discipline,
arXiv category and novelty.
Novelty: How dissimilar is a paper
from its most similar publications?
HierarXy

Keyword Factory
What else should I be searching
for? Generate lists of relevant
keywords based on arXiv data,
without prior knowledge.

Real-time update of papers
Daily updates of our Deep learning,
deep change? Mapping the development of
the Artificial Intelligence General Purpose
Technology paper.
Why?
● Reduce overheads; Policy
makers can find the most
up-to-date results on arXlive.
● Robustness by design.
● Log unexpected changes as
they happen.

Collect the full text of each publication.
● Track funding in AI research by parsing the paper
acknowledgements.
● Identify the inputs and outputs of AI systems.
● Develop a semantic search engine to enable long text queries.
Real-time updates of our Gender Diversity in AI Research paper.
Incorporate additional altmetrics for arXiv papers.
Visual exploration of the search space.
Next steps

nesta.org.uk
@nesta_uk
Website: https://arxlive.org/
GitHub: https://github.com/nestauk/clio-lite
Thank you!
@kstathou

Natural Language Understanding and Inference:
Benchmarks, Resources, and Approaches
Shane Storks (University of Michigan)
Qiaozi Gao (Michigan State University)
Joyce Y. Chai (University of Michigan)

Understanding Natural Language
● Benchmarks that require deep language understanding that goes beyond
what’s explicitly written, and rely on inference and knowledge of the world.
● Knowledge
○ linguistic knowledge (e.g., Penn Treebank, WordNet)
○ common knowledge (e.g., Freebase, DBpedia, YAGO)
○ commonsense knowledge (e.g., ConceptNet, ATOMIC)
"Jack needed some money, so he went and shook his piggy bank.
He was disappointed when it made no sound."
- Why was Jack disappointed? (Minsky, 2000)

Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC

Benchmarks
○ e.g., RTE, SNLI
● Multiple Tasks
○ e.g., GLUE, DNC
- The trophy would not ﬁt in the brown
suitcase because it was too big.
- What was too big?
A. The trophy
B. The suitcase

Benchmarks
○ e.g., RTE, SNLI
● Multiple Tasks
○ e.g., GLUE, DNC
- The trophy would not ﬁt in the brown
suitcase because it was too small.
- What was too small?
A. The trophy
B. The suitcase

Benchmarks
○ e.g., RTE, SNLI
● Multiple Tasks
○ e.g., GLUE, DNC
- Which of these would let the most
heat travel through?
A. a new pair of jeans.
B. a steel spoon in a cafeteria.
C. a cotton candy at a store.
D. a calvin klein cotton hat.
Evidence: Metal is a thermal conductor.

Benchmarks
○ e.g., RTE, SNLI
● Multiple Tasks
○ e.g., GLUE, DNC
- Text: A black race car starts up in front
of a crowd of people.
- Hypothesis: A man is driving down a
lonely road.
- Label: contradiction

Benchmarks
○ e.g., RTE, SNLI
● Multiple Tasks
○ e.g., GLUE, DNC
I knocked on my neighbor’s door.
What happened as result?
A. My neighbor invited me in.
B. My neighbor left his house.

Creating Benchmarks: Criteria and Considerations
● Task Format
○ Classification tasks
○ Open-ended tasks
● Evaluation Scheme
○ Evaluation metrics: objective and easy to calculate
○ Human performance measurement
● Avoiding Data Biases
○ Label distribution bias
○ Question Type Bias in QA
○ Superficial Correlation Bias (gender bias, human stylistic artifacts)

Approaches: General Architecture
● Symbolic approaches
● Statistical approaches
● Latest SOTA use deep neural
network (e.g., transformer) with
built-in pre-trained contextual
embeddings
○ Performance keeps increasing
○ Exceeding human performance
sometimes

Performance Trends
● Many factors may affect progress
on benchmarks
○ Actual task difficulty
○ Data size
○ Year released
○ Number of people working on the
benchmark
○ Data bias
● Performance should be interpreted
with caution

Future Questions
● Doe the benchmark performance really reflect the machine inference
abilities?
● How to explain model behaviors so that humans can understand the
underlying inference process?
● How can we make better use of available knowledge resources?
● How can we train energy/cost efficient models?
○ How the Transformers broke NLP leaderboards - Rogers, 2019
○ Green AI - Schwartz et al., 2019

Creating Benchmarks: Data Biases
● Label Distribution Bias
○ relatively easy to avoid: an equal number of examples for each class
● Question Type Bias in QA
○ distribution of the first words of questions (e.g., CoQA, CommonsenseQA)
○ manually analysis of question categories (e.g., Squad 2.0, ARC)
○ predefined question types (e.g., ProPara)
● Superficial Correlation Bias
○ e.g., gender bias, human stylistic artifacts
○ relatively difficult to avoid
○ adversarial filtering process (e.g., SWAG)

Benchmarks
● Turing Test
○ encouraging machines to deceive humans
○ no feedback on a continuous scale to allow for incremental development
● Early NLP Benchmarks
○ Part-of-speech Tagging
○ Named Entity Recognition
○ Coreference Resolution
○ Information Extraction Jyc: delete this slide

Thank you!
Jyc: at least show two or three slides about approaches:
- One slide on the general architecture
- One slide on example performance? Shane is making a
figure for that, discuss the differences between human
performance and model performance.
Also need a slide to summarize:
- What pending questions from the exercise on
benchmarks.
- What should be some ideas for future direction.

Knowledge Base
Humans perform inference based on vast amount of knowledge about how the
world works. To support machines’ inference ability, a parallel ongoing research
effort in the last several decades is the development of various knowledge
resources.

Knowledge Base Collection
Discuss issues related to collecting knowledge required to perform commonsense
reasoning

Learning and Inference Approaches
● Symbolic Approaches
● Statistical Approaches
● Neural Approaches

Model Generalization
Consequence of previous issue?
Talk about current SOTA models and probing studies (like Niven and Kao, 2019)

PROGRESS IN
COMMERCIAL MACHINE
TRANSLATION SYSTEMS
by Konstantin Savenkov,

Ph.D., CEO Intento

October 29-30, 2019
Stanford University, Human-Centered Artiﬁcial Intelligence (HAI) and AI Index
Workshop on Measurement in AI Policy: Opportunities and Challenges

Intento
Alibaba Amazon Baidu
Cloud
Translate
DeepL eBay
Globalese Google GTCom IBM Iconic Kakao
KantanMT Microsoft Mirai ModernMT Naver Niutrans
Omniscien
Pangea
MT
PROMT PrompsIT Rozetta SAP
SDL Sogou Systran Tencent Tilde Yandex
Youdao
COMMERCIAL MT SYSTEMS
2
All product names, trademarks and registered trademarks are property of their respective owners. All company, product and service names used in this website are for
identiﬁcation purposes only. Use of these names, trademarks and brands does not imply endorsement.
© Intento, Inc. / October 2019

Intento
VENDOR DYNAMICS (STOCK MODELS)
3
Commercial
Alibaba, Amazon,
Baidu, CloudTranslate,
DeepL, Google,
GTCom, IBM, Mirai,
Microsoft, ModernMT,
Naver, Niutrans,
PROMT, Rozetta, SAP,
SDL, Sogou, Systran,
Tilde, Tencent, Yandex,
Youdao
Preview / Limited
eBay, Kakao, QCRI
0
5
10
15
20
25
Mar 18 Jul 18 Dec 18 Jun 19 Nov 19
Preview
Commercial
Intento, Inc. • June 2019

Intento
SUPPORTED LANGUAGE PAIRS
4
1
100
10000
N
iutrans
G
oogle
Yandex
M
icrosoftv3
Sogou
Baidu
Am
azon
Kakao
Systran
Tencent
SDL
PRO
M
T
G
TC
om
SAP
DeepL
M
odernM
TIBM
W
atson
v3
N
aver
Youdao
Alibaba
eBay
Tilde
1
3
2
54
6
8
272
2 202
1
2
20
24
38
5256
72
9090
111121122139
342
594
756
3 4223 782
7 482
10 50613 340
Total
Unique
* where possible, we have checked via API if all language pairs advertised by the
documentation are supported and removed the pairs we were unable to locate in the API.
** as advertised (not validated via API)
Unique
language pairs
- supported
exclusively by
one provider

Intento
MT QUALITY EVALUATION
5
Intento monitors MT Quality since May 2017 (public report
every 4-6 months).
—
48 popular language pairs, based on WMT and other public
news corpora.
—
Reference-based evaluation using hLEPOR score (n=2000,
statistically signiﬁcant)

Intento
BEST MT
ENGINES
(AS OF
JUNE 2019)
6
en ru ja de es fr pt it zh cs tr fi ro ko ar nl
en
ru
ja
de
es
fr
pt
it
zh
cs
tr
fi
ro
ko
ar
nl
MT Engines
deepl
google
amazon
yandex
systran-pnmt
modernmt
ibm
promt
microsoft
tencent
baidu
6
In several cases, there’s no
statistically significant difference
between the top engines.
changed since
Jan 2019:
19 pairs

Intento
MORE INVESTMENT IN MT QUALITY GOES
INTO POPULAR LANGUAGE PAIRS
7
Intento
data curation
—
new architectures
—
direct translation

Intento
MT PROGRESS BEYOND LOW-IMPACT CONTENT
REQUIRES MORE THAN GENERIC MODELS
8
Intento
Cross-language
NLP
High-volume low
impact
Low-impact
(inbound etc)
High-impact
generic
High-impact in-
domain
MACHINES
HUMANS

Intento
MT PROGRESS BEYOND LOW-IMPACT CONTENT
REQUIRES MORE THAN GENERIC MODELS
9
Intento
Cross-language
NLP
High-volume low
impact
Low-impact
(inbound etc)
High-impact
generic
High-impact in-
domain
MACHINES
HUMANS
HOT TOPIC

Intento
2018: RAISE OF DOMAIN-ADAPTIVE NMT
10
Intento
Sep
2017
Oct
2018
Nov
2017
May
2018
Jun
2018
Jul
2018
Globalese
Custom
NMT
Lilt
Adaptive
NMT
IBM
Custom
NMT
Microsoft
Custom
Translator
Google
AutoML
Translation
SDL
ETS 8.0
ModernMT
Enterprise
Apr
2018
Systran
PNMT

Intento
2019: CUSTOM TERMINOLOGY SUPPORT
11
Intento
Jun
2018
Oct
2019
Oct
2018
Jan
2019
Apr
2019
Amazon
Translate
Google
Translate
v3
SDL
BeGlobal
4.1
Microsoft
Custom
Translator
Nov
2018
Systran
PNMT
IBM
Custom
NMT
“forced glossary customisation”
“phrase dictionaries”
“custom terminology”
“syntax-aware
custom terminology”
May
2019
Yandex
Cloud
Translate v2
dynamic glossaries
“glossaries”
“glossary feature”

Intento
IMPROVEMENT BEYOND STOCK MODELS
12
Intento
Stock models deﬁne starting
points
—

Adaptation based on
Translation Memory and
Terminology drives further
improvement
—

Depends on architecture, data
volume and quality

Intento
GENERIC STOCK MODELS
Alibaba Amazon Baidu DeepL eBay Google
GTCom IBM Kakao Microsoft Mirai ModernMT
Niutrans Naver Omniscien PROMT Rozetta SAP
SDL Sogou Systran Tencent Tilde Yandex
DOMAIN ADAPTATION CAPABILITIES
13© Intento, Inc. / October 2019
VERTICAL STOCK MODELS
CUSTOM TERMINOLOGY SUPPORT
AUTO DOMAIN ADAPTATION MANUAL DOMAIN ADAPTATION
Youdao
Alibaba Baidu
Cloud
Translate
Microsoft Omniscien PROMT
SAP Systran
Amazon Baidu Google IBM Microsoft Rozetta SDL Systran Yandex
Globalese Google IBM
Kantan Microsoft ModernMT
Omniscien SDL Systran
Alibaba Baidu
Cloud
Translate
Iconic
Omniscien PangeaMT Prompsit PROMT
SDL Systran Tilde Yandex
All product names, trademarks and registered trademarks are property of their respective owners. All company, product and service names used in this website are for
identiﬁcation purposes only. Use of these names, trademarks and brands does not imply endorsement.

Intento
DATA COLLECTION PRACTICES NEED TO
MATCH GROWING MT UBIQUITY
14
Growing MT quality makes it
ubiquitous
—

Enterprise adoption is far behind
user adoption
—

Data collection policy remains in
the fine print of “free” MT services
—

That’s more important than
collecting cookies (we think)
“We recently found that ~2Gb of
confidential data goes from our
network to (free MT service)”
company Y (2019)
“We tried to block traffic to (free MT
service), but SVP said it will stop
the entire company’s operations”
company X (2018)
“We discovered text that had been
typed in on (MT service) could be
found by anyone conducting a web
search.”
Statoil (Sept 2017, link)

THANKS!
by Konstantin Savenkov,

Ph.D., CEO Intento

October 29-30, 2019
Stanford University, Human-Centered Artiﬁcial Intelligence (HAI) and AI Index
Workshop on Measurement in AI Policy: Opportunities and Challenges

THANK YOU!
Konstantin Savenkov

ks@inten.to

2150 Shattuck Ave

Berkeley CA 94704
INTENTO
https://inten.to
16

Quantifying Algorithmic
Improvements over Time
Lars Kotthoff
University of Wyoming
larsko@uwyo.edu1
Measurement in AI Policy Workshop, 30 October 2019
1
Based on Kotthoff, Lars, Alexandre Fréchette, Tomasz P. Michalak, Talal
Rahwan, Holger H. Hoos, and Kevin Leyton-Brown. “Quantifying Algorithmic
Improvements over Time.” In 27th International Joint Conference on Artificial
Intelligence (IJCAI) Special Track on the Evolution of the Contours of AI, 2018.

Key Ideas
▷ science is not a horse race
▷ reward new ideas and complementary approaches
▷ stand on the shoulders of giants, and give credit to those
giants
1

Contributions – Standalone Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
Standalone Performance 2

Contributions – Marginal Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
98900
18
5
5
3
1
0
0
0
dual pivot (2009)
median 9 (1993)
mid (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
Standalone Performance Marginal Performance
3

Contributions – Shapley Value
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
100267058
100167412
100153715
100153384
100151097
100131186
99434662
98059604
84173
98900
18
5
5
3
1
0
0
0
dual pivot (2009)
median 9 (1993)
mid (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
Standalone Performance Shapley Value Marginal Performance 4

Contributions – Temporal Shapley Value
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
100267058
100167412
100153715
100153384
100151097
100131186
99434662
98059604
84173
405450356
392238462
671833
98900
57198
50411
22506
10074
2550
13212030
671833
98900
20497
15703
6907
552
541
137
dual pivot (2009)
median 9 (1993)
mid (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
median 3 (1978)
first (1961)
Standalone Performance
Shapley Value
Temporal Shapley Value
Temporal Marginal Performance
5

Quicksort Over Time
1e+02
1e+05
1e+08
1946 1961 1978 1993 2009
year
SumoftemporalShapleyvalues
6

SAT Competition
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
144.25268
144.08601
139.91915
57.41773
56.33454
51.75097
44.43418
43.43408
35.08748
28.8338
27.08395
19.16712
19.13952
16.3337
12.43352
10.19402
9.69765
9.0321
8.30224
8.0001
7.75017
7.03675
6.64388
4.87001
4.50018
4.26672
4.26672
3.91678
2.00002
2.00001
1.83337
1.16668
0.83336
4e−05
0
0
0
0
78.42398
63.90765
55.75744
55.15192
52.43065
50.72959
45.33413
44.61692
44.50427
36.26344
31.60638
30.53198
30.41135
28.4814
25.76449
21.82523
20.7125
20.49854
20.15654
19.71357
18.71084
17.82205
16.86642
16.31361
15.36641
14.86641
14.18306
13.18303
9.83857
8.74145
4.81501
3.56499
1.90815
1.88997
0.53389
0.45055
0.14286
0
gnovelty+_2007
ranov_2007
adaptnovelty_2007
TNM_2009
sparrow2011_2011
sapsrt_2007
March−KS_2007
KCNFS_2007
dimetheus_2.100_2014
hybridGM3_2009
iPAWS_2009
sattime2011_2011
BalancedZ_2014
adaptg2wsat2011_2011
DEWSATZ−1A_2007
CSCCSat2014_SC2014_2014
CCgscore_2014
probSAT_sc14_2014
Ncca+_v1.05_2014
CSHCrandMC_2013
gnovelty+2_2009
YalSAT_03l_2014
CCA2014_2.0_2014
sattime_2014
MPhaseSAT_M−2011−02−16_2011
minisat−SAT_2007
MXC_2007
gNovelty+−T_2009
csls−pnorm−8cores_2011
march_br_sat+unsat_2013
march_rw−2011−03−02_2011
MiraXT−v3_2007
march_hi_2011
gNovelty+GCwa_1.0_2013
minipure_1.0.1_2013
Solver43a_a_2013
Solver43b_b_2013
strangenight_satcomp11−st_2013
dimetheus_2.100_2014
BalancedZ_2014
CCgscore_2014
CSCCSat2014_SC2014_2014
probSAT_sc14_2014
Ncca+_v1.05_2014
CCA2014_2.0_2014
sattime_2014
YalSAT_03l_2014
sparrow2011_2011
TNM_2009
sattime2011_2011
adaptg2wsat2011_2011
MPhaseSAT_M−2011−02−16_2011
CSHCrandMC_2013
ranov_2007
iPAWS_2009
gnovelty+_2007
adaptnovelty_2007
hybridGM3_2009
gnovelty+2_2009
gNovelty+−T_2009
march_br_sat+unsat_2013
gNovelty+GCwa_1.0_2013
march_rw−2011−03−02_2011
march_hi_2011
March−KS_2007
KCNFS_2007
csls−pnorm−8cores_2011
sapsrt_2007
DEWSATZ−1A_2007
minipure_1.0.1_2013
MXC_2007
minisat−SAT_2007
MiraXT−v3_2007
Solver43a_a_2013
Solver43b_b_2013
strangenight_satcomp11−st_2013
Temporal Shapley Value Shapley Value
7

SAT Competition Over Time
0
200
400
600
2007 2009 2011 2013 2014
year
SumoftemporalShapleyValues
8

MiniZinc (CP) Competition
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
9987.86155
360.7717
1261.4996
29828.71058
4376.2985
922.81764
49647.2861
0
170.56476
21435.6316
5163.9305
197.036
16878.17865
12376.60633
193.93191
4775.12936
17831.07372
14097.37673
8886.77925
91361.24195
150.02742
20393.03799
192.78811
3267.92072
12513.1466
9828.08541
32189.36215
34970.99061
7059.47499
15552.43907
6190.81406
7087.25379
6756.47091
13073.56821
14389.56267
5262.55503
15095.99322
16977.5178
36090.70992
6071.40394
18324.39121
15746.33068
15810.6008
52.19571
0
6853.37461
11324.41791
Choco_2014
Choco_2016
Choco3_2015
Chuffed_2015
Chuffed_2016
Concrete_2016
G12Chuffed_2014
G12FD_2015
G12FD_2016
Gecode_2014
Gecode_2015
Gecode_2016
JaCoP_2014
JaCoP_2015
JaCoP_2016
LCG−Glucose_2016
OpturionCPX_2014
OpturionCPX_2015
OR−Tools_2015
ORTools_2014
PicatCP_2014
PicatCP_2016
SICStus_2014
SICStus_2016
Choco_2014
Choco_2016
Choco3_2015
Chuffed_2015
Chuffed_2016
Concrete_2016
G12Chuffed_2014
G12FD_2015
G12FD_2016
Gecode_2014
Gecode_2015
Gecode_2016
JaCoP_2014
JaCoP_2015
JaCoP_2016
LCG−Glucose_2016
OpturionCPX_2014
OpturionCPX_2015
OR−Tools_2015
ORTools_2014
PicatCP_2014
PicatCP_2016
SICStus_2014
SICStus_2016
Temporal Shapley Value Shapley Value
9

MiniZinc Competition Over Time
0
50000
100000
150000
200000
2014 2015 2016
year
SumoftemporalShapleyValues
10

Summary
▷ standalone performance does not indicate how algorithms
complement each other
▷ marginal performance is not fair
▷ Shapley Value
▷ provides better characterization of algorithms’ performance
▷ rewards algorithms that introduce novel and complementary
concepts
▷ enables better analysis of algorithms’ performance
▷ Temporal Shapley Value
▷ takes when an algorithm was conceived into account
▷ all desirable properties of Shapley Value
▷ rewards earlier algorithms, which may have inspired later
algorithms
11

Contributions – Temporal Marginal Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
13212030
671833
98900
20497
15703
6907
552
541
137
dual pivot (2009)
median 9 (1993)
mid (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
median 3 (1978)
first (1961)
Standalone Performance Temporal Marginal Performance 12

GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
Marcelo O.R. Prates morprates@inf.ufrgs.br
Pedro H.C. Avelar phcavelar@inf.ufrgs.br
Luis C. Lamb lamb@inf.ufrgs.br
October 2019
1

ASSESSING GENDER BIAS IN MACHINE
TRANSLATION
This presentation is based on our work “Assessing Gender Bias in Machine
Translation – A Case Study with Google Translate”, (PRATES; AVELAR;
LAMB, 2019) and includes a short description of our work on Quantifying
the Role of Ethics in AI Research (PRATES; AVELAR; LAMB, 2018).
2

ETHICS IN AI RESEARCHMACHINE BIAS
• Machine Bias is a topic of great interest in academia and industry.
• Biases have been identiﬁed in several systems (ANGWIN et al., 2016;
BOLUKBASI et al., 2016; CHO et al., 2019; GARCIA, 2016; MILLS,
2017; PAPENFUSS, 2017; WEBSTER et al., 2018; ZHAO et al.,
2018).
• “Including gender analysis in research can save us from
life-threatening errors.” (SCHIEBINGER, 2014)
• Thus, solving bias in AI systems is important to achieve a fairer
society.
3

ETHICS IN AI RESEARCHBIAS IN WORD EMBEDDINGS
(BOLUKBASI et al., 2016) identiﬁed Biases in Word Embeddings and
argued that debiasing was necessary before applying these methods in real
world applications.
There have been hundreds of papers written about word em-
beddings and their applications (...). However, none of these
papers have recognized how blatantly sexist the embeddings are
and hence risk introducing biases of various types into real-world
systems.
(...)
One perspective on bias in word embeddings is that it merely
reﬂects bias in society, and therefore one should attempt to debias
society rather than word embeddings. However, by reducing the
bias in today’s computer systems (or at least not amplifying the
bias), which is increasingly reliant on word embeddings, in a
small way debiased word embeddings can hopefully contribute
to reducing gender bias in society. At the very least, machine
learning should not be used to inadvertently amplify these biases,
as we have seen can naturally happen. 4

ETHICS IN AI RESEARCHGENDER BIAS IN MACHINE TRANSLATION
Figure: Example translations which were trending in social media.
5

ETHICS IN AI RESEARCHMAIN IDEAS
• There was great social media interest on solving MT gender bias for
professions, in particular in the translation from gender neutral
languages.
• We this issue, by providing a transparent way of assessing gender bias
in MT systems.
• We provide a case study with a widely used system and compare it
with real world gender distributions.
• Extra: We provide a similar study for adjectives.
6

ETHICS IN AI RESEARCHDATA – LANGUAGES
• Languages
• With Gender Neutral Pronouns and supported by GT:
• Armenian, Basque, Bengali, Chinese – Mandarin (pinyin),
Estonian, Finnish, Hungarian, Japanese, Malay, Swahili,
Turkish, Yoruba.
• We did not include some GN Languages (Nepali, Korean and
Persian) due to diﬃculties in providing template/processing the
data.
7

ETHICS IN AI RESEARCHDATA – OCCUPATIONS
• Labour Data
• Extracted from the U.S. Bureau of Labor Statistics (Bureau of
Labor Statistics, 2017)
• Manually curated.
• Most occupations had data on gender distribution.
• Missing data imputed as category aggregate. For example:
- The profession “Sociologists” doesn’t have enough data to
contain a percentage of female participation.
- Its % is imputed as the aggregate in its category “Life,
physical, and social science occupations”.
- Two thousand employed (sociologists), with 47.4% women
(from Life, physical, and social science occupations).
8

ETHICS IN AI RESEARCHDATA – ADJECTIVES
• Adjectives
• Extracted from CoCA <https://corpus.byu.edu/coca/>
• Manually curated from the top 1,000 most frequent adjectives.
9

ETHICS IN AI RESEARCHRESULTS – OCCUPATION CATEGORY
Category
Healthcare
Production
Education
Farming
Fishing
Forestry
Service
Construction
Extraction
Corporate
Arts
Entertainment
STEM
Legal
Neutral
Female
Male
Gender
0
50
100
%
Figure: Plot showing how diﬀerent Occupation Categories have diﬀerent
distributions of translation pronouns.
10

ETHICS IN AI RESEARCHRESULTS – LANGUAGE
Language
Basque
Bengali
Yoruba
Chinese
Finnish
Hungarian
Turkish
Japanese
Estonian
Swahili
Armenian
Malay
Neutral
Female
Male
Gender
0
50
100
%
Figure: Plot showing how diﬀerent Languages have diﬀerent distributions of
translation pronouns.
11

ETHICS IN AI RESEARCHRESULTS – GT VS REAL DISTRIBUTION
12-quantile
1 2 3 4 5 6 7 8 9 10 11 12
0
10
20
30
40
Frequency(%)
Google Translate Female %
BLS Female Participation %
Data
Figure: Plot showing severe underestimation of female participation.
12

ETHICS IN AI RESEARCHRESULTS – ADJECTIVES
Adjective
Happy
Shy
Desirable
Sad
Dumb
Mature
Smart
Polite
Sympathetic
Loving
Modest
Wrong
Afraid
Innocent
Strong
Successful
Right
Brave
Cruel
Guilty
Proud
Neutral
Female
Male
Gender
0
50
100
%
Figure: Most adjectives seem to adopt male defaults, but some speciﬁc words
show certain trends, as “Guilty”, while some adjectives such as “shy” and
“happy” seem to skew less towards male translations. 13

ETHICS IN AI RESEARCHRESULTS – IMPROVEMENTS IN GT
Figure: GT provided translation alternatives shortly after our paper.
14

ETHICS IN AI RESEARCHLIMITATIONS
• None of us were speakers of the gender-neutral languages.
• None of us identiﬁed themselves as female.
• GT doesn’t provide conﬁdence scores for words in the API .
• Our work was limited to a single template translation per word
(except for Bengali).
• The occupation list is from a single source (BLS).
• Occupations were forward translated to be back-translated again.
15

ETHICS IN AI RESEARCHPOLICY SUGGESTIONS
• MT tools could provide alternative translations (GT has been updated
to include this).
• MT tools could provide conﬁdence scores for individual words.
• Automatic evaluation can help detect bias in a system and call for
further action.
• Datasets could have a curated subset to enforce parity.
16

ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES
• Related work: quantifying the role of ethics in AI research (PRATES;
AVELAR; LAMB, 2018)
• Searched for ethics related keywords in ﬂagship conference abstracts
and titles.
• Although ethics is being more and more commonly discussed in
workshops, it is not typically discussed in the main ﬂagship conference
tracks.
17

Conferences
AAAI IJCAI NIPS ICML ICRA IROS
7, 179 7, 723 6, 509 3, 568 19, 368 15, 005
Journals
ACM
Trans.
Comm.
ACM
IEEE.
Com-
puter
JAIR IEEE
Trans.
AI
Artif.
Intell.
18, 199 11, 394 6, 694 972 10, 731 2, 766
Table: Sample sizes in number of papers for the analysed venues.
18

0
0.002
0.004
0.006
0.008
0.01
0.012
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Averagenºofmatches
Year
AAAI
IJCAI
NIPS
ICML
ICRA
IROS
Figure: Frequency of the selected ethics-related keywords in each ﬁve year
interval in paper titles
19

ETHICS IN AI RESEARCHRELATED WORK
• Related work:
• (CHO et al., 2019) performed a similar evaluation for Korean on
three diﬀerent translation tools, using multiple sentence
templates.
• (STANOVSKY; SMITH; ZETTLEMOYER, 2019) evaluated
gender bias for 8 languages and 6 MT systems for correct
translation alignments.
• (KUCZMARSKI; JOHNSON, 2018) proposed techniques to
produce both translations in all genders in the target language.
• (ZHAO et al., 2018; RUDINGER et al., 2018; WEBSTER et al.,
2018) provided corpora for pronoun resolution and assessing
gender bias.
20

ETHICS IN AI RESEARCHCHO ET AL.
• Korean speakers
• Provided a way to test MT systems for the Korean language.
• Tested on 3 diﬀerent MT systems.
• Used multiple sentence templates per pair
21

ETHICS IN AI RESEARCHCHO ET AL.
Figure: Cho et al. tested on diﬀerent systems, including GT and Naver Papago
(NP). Reuse of this image was kindly permitted by Cho et al.
22

ETHICS IN AI RESEARCHSTANOVSKY, SMITH, ZETTLEMOYER
• (STANOVSKY; SMITH; ZETTLEMOYER, 2019) based their studies
in previous studies regarding Gender bias in coreference resolution
(ZHAO et al., 2018; RUDINGER et al., 2018).
• Tested on 6 diﬀerent MT systems, 4 commercial ones.
• Tested sentences based on automatic tools and checking for gender
alignment between the source and target sentences.
• Also performed manual annotation for a small subset of 100 sentences
with 2 native annotators.
23

ETHICS IN AI RESEARCHKUCZMARSKI, JOHNSON
• Proposed techniques to produce both translations in all genders in the
target language.
• In Summary:
• Identify if a translation query may need gendered translation.
• If so, translate the sentence forcing all possible genders in the
target language.
• Post-process to see if produced sentences are appropriate.
• Present gendered tuple to user if so, otherwise translate as
normal.
• Similar to what GT seems to have adopted.
24

ETHICS IN AI RESEARCHGENDER BIAS CORPORA
• (ZHAO et al., 2018; RUDINGER et al., 2018; WEBSTER et al.,
2018) provided corpora for gendered pronoun resolution.
• Can be used to benchmark MT tools.
• Also identiﬁed and called biases to attention.
25

ETHICS IN AI RESEARCHFUTURE WORK
• Future Work:
• We are not aware of a study similar to (CHO et al., 2019) for
the Persian or Nepali languages.
• Cho et al. are looking to expand their work to multiple
languages.
• We are expanding some of our experiments on bias in MT.
• We are - very - open to collaboration and suggestions.
26

THANK YOU
THANK YOU.
Thank You!
Contacts:
morprates@inf.ufrgs.br
phcavelar@inf.ufrgs.br
lamb@inf.ufrgs.br
27

THANK YOU
BIBLIOGRAPHY I
ANGWIN, J. et al. Machine bias: There’s software
used across the country to predict future criminals and
it’s biased against blacks. 2016. Last visited 2017-12-
17. Disponível em: <https://www.propublica.org/article/
machine-bias-risk-assessments-in-criminal-sentencing>.
BOLUKBASI, T. et al. Man is to computer programmer as woman
is to homemaker? debiasing word embeddings. In: NIPS. [S.l.: s.n.],
2016. p. 4349–4357.
Bureau of Labor Statistics. "Table 11: Employed persons by
detailed occupation, sex, race, and Hispanic or Latino ethnicity,
2017". [S.l.], 2017.
28

THANK YOU
BIBLIOGRAPHY II
CHO, W. I. et al. On measuring gender bias in translation of
gender-neutral pronouns. In: Proceedings of the First Workshop
on Gender Bias in Natural Language Processing. Florence, Italy:
Association for Computational Linguistics, 2019. p. 173–181. Disponível
em: <https://www.aclweb.org/anthology/W19-3824>.
GARCIA, M. Racist in the machine: The disturbing implications of
algorithmic bias. World Policy Journal, Duke Univ Press, v. 33, n. 4,
p. 111–117, 2016.
KUCZMARSKI, J.; JOHNSON, M. Gender-aware natural language
translation. 2018.
MILLS, K.-A. ’Racist’ soap dispenser refuses to help dark-
skinned man wash his hands - but Twitter blames ’technology’.
2017. Last visited 2017-12-17. Disponível em: <http://www.mirror.co.
uk/news/world-news/racist-soap-dispenser-refuses-help-11004385>.
29

THANK YOU
BIBLIOGRAPHY III
PAPENFUSS, M. Woman In China Says Colleague’s
Face Was Able To Unlock Her iPhone X. 2017. Last visited
2017-12-17. Disponível em: <http://www.huffpostbrasil.com/entry/
iphone-face-recognition-double_us_5a332cbce4b0ff955ad17d50>.
PRATES, M. O. R.; AVELAR, P. H.; LAMB, L. C. Assessing gender
bias in machine translation: a case study with google translate. Neural
Computing and Applications, Mar 2019. ISSN 1433-3058. Disponível
em: <https://doi.org/10.1007/s00521-019-04144-6>.
PRATES, M. O. R.; AVELAR, P. H. C.; LAMB, L. C. On quantifying
and understanding the role of ethics in AI research: A historical account
of flagship conferences and journals. In: GCAI. [S.l.]: EasyChair, 2018.
(EPiC Series in Computing, v. 55), p. 188–201.
RUDINGER, R. et al. Gender bias in coreference resolution. In:
NAACL-HLT (2). [S.l.]: Association for Computational Linguistics,
2018. p. 8–14.
30

THANK YOU
BIBLIOGRAPHY IV
SCHIEBINGER, L. Scientiﬁc research must take gender into account.
Nature, Nature Publishing Group, v. 507, n. 7490, p. 9, 2014.
STANOVSKY, G.; SMITH, N. A.; ZETTLEMOYER, L. Evaluating
gender bias in machine translation. In: ACL (1). [S.l.]: Association for
Computational Linguistics, 2019. p. 1679–1684.
WEBSTER, K. et al. Mind the gap: A balanced corpus of gendered
ambiguous pronouns. In: Transactions of the ACL. [S.l.: s.n.], 2018.
p. to appear.
ZHAO, J. et al. Gender bias in coreference resolution: Evaluation
and debiasing methods. In: NAACL-HLT (2). [S.l.]: Association for
Computational Linguistics, 2018. p. 15–20.
31

Breakout 1. Research and Development, including Technical Performance.

Recommended

Recommended

More Related Content

Similar to Breakout 1. Research and Development, including Technical Performance.

Similar to Breakout 1. Research and Development, including Technical Performance. (20)

More from Saurabh Mishra

More from Saurabh Mishra (6)

Recently uploaded

Recently uploaded (20)

Breakout 1. Research and Development, including Technical Performance.