SlideShare a Scribd company logo
1 of 156
30 October 2019
Ashley J. Llorens
Chief, Intelligent Systems Center
Johns Hopkins Applied Physics Laboratory
www.jhuapl.edu/isc
Technology Vectors for Intelligent Systems
HAI AI Index – Research and Development Breakout Session
A Systems View of Artificial Intelligence
• An intelligent system is an agent that
has the ability to perceive its
environment, decide upon a course of
action, act within a framework of
acceptable actions, and team with
humans and other agents to accomplish
a human-specified mission.
• Even when performing tasks
autonomously, an intelligent system is
always part of a human-machine team.
• To facilitate effective delegation of tasks
to an agent, humans must have
appropriately calibrated trust in the
agent’s capabilities.
• We see it as imperative that
advancements in AI and associated
metrics span these key attributes of
intelligent system capabilities: perceive,
decide, act, team, trust.
2Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
An AI-assisted handshake at JHU/APL’s Intelligent Systems Center
3
Envisioned Futures Enabled by Intelligent Systems
Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
• Over the past year, the Johns
Hopkins Applied Physics Laboratory
(JHU/APL) performed an analysis of
envisioned futures for national
security, space exploration and
human health that could potentially
be enabled by targeted
advancements in intelligent
systems.
• This effort has produced four
essential technology vectors to
guide the progress of artificial
intelligence in the coming decades
towards addressing critical national
global challenges.
4
Technology Vector 2: Superhuman decision-making and autonomous action:
Systems that identify, evaluate, select, and execute effective courses of action with
superhuman speed and accuracy for real-world challenges
Technology Vector 1: Autonomous perception:
Systems that reason about their environment, focus on the mission-critical aspects of the
scene, understand the intent of humans and other machines and learn through exploration
Technology Vector 3: Human-machine teaming at the speed of thought:
Systems that understand human intent and work in collaboration with humans to perform
tasks that are difficult or impossible for humans to carry out with speed and accuracy
Technology Vector 4: Safe and assured operation:
Systems that are robust to real-world perturbation and resilient to adversarial attacks with
ethical reasoning and goals that are guaranteed to remain aligned with human intent
Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
Machines select and perform appropriate behaviors
5Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
Collaborating to Advance Artificial Intelligence
• JHU/APL aims to accelerate progress
along these through our own research
and by engaging the broader
ecosystem.
• Challenge problems are an important
tool for sparking collaboration towards
key advancements.
• Reconnaissance Blind Chess was
crafted with this mind and is a
featured challenge at this year’s
Neural Information Processing
Systems (NeurIPS 2019).
• We see our Technology Vectors as
focal points for charting the landscape
of emerging developments in AI while
identifying gaps and informing future
investments and policy development.
6
Towards more meaningful
evaluations in AI
Christopher Potts
Stanford Linguistics
AI Index Workshop on Measurement in AI Policy
Special thanks to Atticus Geiger and
Robin Jia for helpful discussion!
Standard evaluations
1.  Create a dataset from a single process
2.  Divide the dataset into disjoint train and test sets, and set
the test set aside.
3.  Develop systems on the train set.
4.  Only after all system development is complete, evaluate
the systems based on accuracy on the test set.
5.  Report the results as providing an estimate of
the system’s capacity to generalize.
Adversarial testing
The Natural Language Inference (NLI) task
Premise Relation Hypothesis
1. turtle contradicts linguist
2. A turtle danced. entails A turtle moved.
4. Some turtles walk. neutral Some rabbits move.
5.
James Byron Dean refused to move
without blue jeans.
entails
James Dean didn’t dance without
pants.
6.
Mitsubishi Motors Corp’s new vehicle
sales in the US fell 46 percent in June.
contradicts Mitsubishi’s sales rose 46 percent.
Stanford Natural Language Inference Corpus (SNLI)
The best NLI systems fail on mildly adversarial tests
Premise Relation Hypothesis
Train
A little girl kneeling in the dirt crying.
entails A little girl is very sad.
Adversarial entails A little girl is very unhappy.
Premise Relation Hypothesis
Train
A woman is pulling a child on a sled
in the snow.
entails
A child is sitting on a sled in the
snow.
Adversarial
A child is pulling a woman on a sled
in the snow.
neutral
The Stanford Question Answering Dataset (SQuAD)
Passage
Peyton Manning became the first quarterback ever to lead two different
teams to multiple Super Bowls. He is also the oldest quarterback ever to play
in a Super Bowl at age 39. The past record was held by John Elway, who led
the Broncos to victory in Super Bowl XXXIII at age 38 and is currently
Denver’s Executive Vice President of Football Operations and General
Manager.
Question
What is the name of the quarterback who was 38 in Super Bowl XXXIII?
Answer
John Elway
Training example
Passage
Peyton Manning became the first quarterback ever to lead two different
teams to multiple Super Bowls. He is also the oldest quarterback ever to play
in a Super Bowl at age 39. The past record was held by John Elway, who led
the Broncos to victory in Super Bowl XXXIII at age 38 and is currently
Denver’s Executive Vice President of Football Operations and General
Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl
XXXIV.
Question
What is the name of the quarterback who was 38 in Super Bowl XXXIII?
Answer
John Elway
Adversarial test example (Jia et al., EMNLP 2017)
Passage
Peyton Manning became the first quarterback ever to lead two different
teams to multiple Super Bowls. He is also the oldest quarterback ever to play
in a Super Bowl at age 39. The past record was held by John Elway, who led
the Broncos to victory in Super Bowl XXXIII at age 38 and is currently
Denver’s Executive Vice President of Football Operations and General
Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl
XXXIV.
Question
What is the name of the quarterback who was 38 in Super Bowl XXXIII?
Answer
John Elway Model Prediction: Jeff Dean
Devastating effects
The average performance of 16
published models trained on
SQuAD drops from a 75% F1
score to a 36% F1 score
System performance is also
shuffled, suggesting a
certain brittleness.
Measuring human performance
Premise Relation Hypothesis
1. turtle linguist
2. A turtle danced. A dog jumped.
4. A photo of a race horse. A photo of an athlete
5. A chef using a barbecue. A person using a machine.
6.
Mitsubishi Motors Corp’s new vehicle
sales in the US fell 46 percent in June.
Mitsubishi’s sales rose 46 percent.
contradicts
???
neutral
???
???
Our human tasks are machine tasks and therefore
understate human performance.
The Turing Test
A machine’s behavior is intelligent if it can trick a human
interrogator into thinking it is human using only conversation.
People are bad at the Turing Test!
Report from the first Turing Test (Schieber 1994)
Cynthia Clay, the Shakespeare aficionado, was thrice misclassified
as a computer. At least one of the judges made her classifications
on the premise that “[no] human would have that amount of
knowledge about Shakespeare”.
Turing Test event at the University of Reading
“A computer program called Eugene Goostman, which simulates a
13-year-old Ukrainian boy, is said to have passed the Turing test”
Somewhere between accuracy and Turing tests
Can a system perform more accurately on a friendly test set than a human
performing that same machine task? (Standard)
Can a system perform like a human in open-ended adversarial communication?
(Turing test)
Thanks!
Can a system behave systematically (even if it’s not accurate)?
Can a system assess its own confidence – know when not to make a prediction?
Can a system make people happier and more productive?
Dewey Murdick
Director of Data Science
dewey.murdick@georgetown.edu
https://cset.georgetown.edu
1
Highlighting ongoing work done in collaboration with Michael
Page, Daniel Chou, James Dunham, and Jennifer Melot
Connecting policymakers to high-quality analysis of emerging
technologies and their security implications (initial focus on AI)
#Nonpartisan #EmergingTech #AIpolicy
#security #analysis #ML #AI
2
PC: Georgetown University
Questions drive CSET… for example:
1. Will the big tech companies dominate the frontier of AI R&D in
3-5 years?
2. What impact does private-sector AI innovation have on a
country’s hard and soft power?
3. How much progress on AI in China is due to indigenous research
vs. legal and/or extralegal tech transfers?
4. What does collaboration with the defense-sector mean for a tech
company (within and outside of the HQ country)?
5. Will China develop an indigenous semiconductor industry
competitive with the US?
3
Questions drive CSET… for example:
1. Will the big tech companies dominate the frontier of AI R&D in
3-5 years?
2. What impact does private-sector AI innovation have on a
country’s hard and soft power?
3. How much progress on AI in China is due to indigenous research
vs. legal and/or extralegal tech transfers?
4. What does collaboration with the defense-sector mean for a tech
company (within and outside of the HQ country)?
5. Will China develop an indigenous semiconductor industry
competitive with the US?
4
Measures and metrics stay linked to questions
(we think metrics without context can lead to confusion in DC)
Example 1 - Will the big tech companies dominate
the frontier of AI R&D in 3-5 years?
● Research output over time
○ Industry vs. academia: % of top papers (by citation and venue)
○ Within industry: % of top industry papers (by citation and venue)
● Talent acquisition type and hiring rates over time
○ Absolute and relative number of job postings by corporation and AI-relevant skill sets
○ Fraction of top-tier AI talent within the industry (e.g., résumés and CVs)
● Investment, funding flows, and market share over time
○ Absolute and relative measures for research and development grants & contracts
○ Public and private company investments (e.g., M&A, private equity, etc.)
○ Number of innovative product releases (w/ AI-integration), market type and share, etc.
● Calibrated community-of-practice-based technical forecasts
○ Probability community will mature (e.g., workforce size, corporate involvement, investment)
○ Forecasted applications; community research level, technology readiness, horizon 1-3, etc.
Note: Developing cross-source indicators (e.g., fusion by organization)
5
● Hard Power - Candidate Measures and Metrics
○ Flow of private-sector talent to defense agencies or defense contractors
○ Level of AI RDT&E investment (e.g., funding, staff) by defense agencies
○ Number and fraction of top AI companies that take defense contracts
○ Number of defense systems (develop, deploy) that apply AI capabilities
● Soft Power - Candidate Measures and Metrics
○ Presence at top international ML conf. & international collaboration rates
○ Fraction of AI workforce trained within a given country
○ Net skilled talent flow in/out of country; fraction of foreign talent that
emigrate
○ Role in establishing international governance structures and norms for AI
Example 2 - What impact does private-sector AI
innovation have on a country’s hard & soft power?
6
Active Lines of Research (AI Focus)
7
AI Applications
& Implications
Competitiveness
State of Play
Forecasting
Talent
Investment
Hardware
Data, algorithms & models Alliances
AI safety
Weapons
Military power
Cyber operations
Fusion of foreign and domestic S&T data sources
3. Workforce / Talent
○ Job Postings (English, Chinese)
○ CVs and Resume Data
○ FOIA Visa / Immigration Data / Port
of Entry Data (English)
4. Analyst-directed data sources
○ Targeted Surveys
○ Human Annotated Data
○ Prioritized Translations
○ Intent / Policy Docs, etc.
And more...
8
1. Technical Text
○ Scholarly Literature (English,
Chinese, Russian)
○ Dissertations & Theses (Chinese,
English, etc.)
○ Tech News (Chinese, Worldwide)
○ Patents (Worldwide)
○ News wire (Worldwide)
2. Worldwide Funding
○ Grant Funding
○ Financial Transactions for Publicly
and Privately-held Corporations
○ Venture Financial Transactions
○ Spending by governments
Upcoming
What we’re doing
1. Releasing analytic reports
2. Launch fortnightly
e-newsletter
3. Acquiring and improving
relevant data sets
4. Establishing and calibrating a
forecasting capability
5. Next CSET Seminar, Remco
Zwetsloot, Nov 20 (in DC)
What you can do
1. Subscribe to our e-newsletter
at cset.georgetown.edu
2. Tell us how we can help --
what are your AI-related
questions, and what
knowledge gaps have you
seen?
3. Help develop new indicator
features & language models
4. Help develop good
measures and metrics that
answer key AI questions
9
10
“Moore’s Law” of
Academic Knowledge:
> 1 M titles/month
10
100
1000
10000
100000
1000000
1960
1962
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
2014
2016
2018
WHAT DO AI WINTERS LOOK LIKE?
Deep Learning
Year
AI
Artificial Neural Network
PublicationCount
https://www.usnews.com/educ
ation/best-global-
universities/search?region=&subj
ect=computer-science&name=
(2019)
https://www.usnews.com/educ
ation/best-global-
universities/search?region=&subj
ect=computer-science&name=
(2020)
Anyone with knowledge of computer science research will see these rankings for
what they are – nonsense – and ignore them. But others may be seriously misled.
It is unreasonable to expect that departments half-way around the world will have
anything close to an accurate assessment of each other
…the methodology makes inferences from the wrong data without transparency
and, consequently, it arrives at an absurd ranking.
Inference & Reasoning
Machine Readers (NLP)
Reinforcement Learning
𝑐(𝑡)
𝑠(𝑡)
𝑠(𝑡 − 𝑇)
Knowledge graph
Search results &
Recommendation
Citation behavior
Existing ranking
AI Components in Microsoft Academic
Entities+ Relations
Past 10YearsAllTime Past 5Years
30 October 2019
Maria de Kleijn, Senior Vice President Analytical Services
Artificial Intelligence
Peer reviewed research – volume
and quality metrics
| 2
Experts agree there is no common definition on AI
“There is no
commonly
agreed
ontology for
AI”
“It’s just
statistics on
steroids”
“An umbrella term to
describe the capability to
make computers apply
judgment as a human
being would”
“Many people
say AI when
they actually
mean machine
learning”
| 3
AI corpus definition at article level
| 4
Data on peer-reviewed articles and conference
proceedings from Scopus
Article
70+ million
Journal, conference,
& Book records
Author
16+ million
Author profiles
(active)
Affiliation
70,000+
Affiliation profiles
Other sources used for
quantitative analysis
• Preprint servers (arXiv)
• PlumX dashboard
• Online competitions (Kaggle)
• ScienceDirect
• Graduate information (CAS,
China)
76
million
Items
16
million
Author profiles
~70,000
Affiliation
Profiles
1.4 billion cited
references
dating back to 1970
| 5
Globally, AI structures
into seven research
clusters
Search and
Optimization
Fuzzy
Systems
Planning and
Decision
Making
Natural Language
Processing and
Knowledge
Representation Computer Vision
Neural
Networks
Machine
Learning and
Probabilistic
Reasoning
Using AI to define and
structure AI
• Trained classifier to
distinguish AI papers
from non-AI papers
• Supervised learning
using keyword co-
occurrence to
structure the field
| 6
Source: Scopus
Research output per year, per cluster, globally
| 7
US: strong corporate sector
Key Contribution
(academic and corporate
institutions)
Number of publications
(all)
Field-Weighted Citation Impact
| 8
US: attracting overseas talent to its corporations
-318
| 9
AI research is found in computer
science, and in application areas
like medicine, energy,
biochemistry
Topic cluster “semantics;
models; recommender systems”
| 10
“Semantics; models; recommender systems” in itself
has interdisciplinary components
| 11
Conclusion
• Tracking research in AI poses particular challenges, that can be
overcome with machine learning – “using AI to define AI”
• It takes a well structured database – linking articles to authors and
institutions – to get insights beyond simple volume metrics
• Scientometrics can help answer key policy questions like brain
drain/gain and the role of corporates
• AI research moving from ‘core’ computer science to application fields
is visible in the data
• Insights go beyond metrics!
| 12
Available resources
AI Resource Center:
https://www.elsevier.com/connect/ai-
resource-center
Download AI Report:
https://www.elsevier.com/research-
intelligence/ai-report
Thank you
| 14
Backup
| 15
w
How is AI
being taught?
How is AI
researched?
How is AI being
talked about in
media?
How is AI being
described in
patents?
Achieving policy objectives requires actions across sectors
| 16
AI corpus verification
| 17
Keywords shared
across all 4
perspectives:
• Artificial Intelligence
• Deep Learning
• Machine Learning
• Neural Network
• Reinforcement Learning
• Speech Recognition`
AI seems to lack a common language
Teaching
268
Media
82
Research
42
Industry
641
| 18
US: ability to also retain strong researchers
Migratory
Outflow
Migratory Inflow
Transitory
Sedentary
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00
Relative productivity
Relative impact versus relative productivity
Kostas Stathoulopoulos
arXlive
Real-time monitoring of research activity in arXiv
Motivation
How do we break down “AI”
and search for its subfields?
How do we discover novel
research as it happens?
How do we enable
policymakers search for
specialised topics without the
help of experts?
Motivation What is arXlive?
How do we break down “AI”
and search for its subfields?
How do we discover novel
research as it happens?
How do we enable
policymakers search for
specialised topics without the
help of experts?
A scalable, shareable and
flexible open source platform
for real-time monitoring of
research activity in arXiv
preprints.
arXlive: a data analysis and production system
Abstract
Authors
arXlive: a data analysis and production system
Abstract
GeographyAffiliationsAuthors
arXlive: a data analysis and production system
Abstract
GeographyAffiliationsAuthors
Search
Novelty
model
HierarXy
arXlive: a data analysis and production system
Abstract
GeographyAffiliationsAuthors
Search
Novelty
model
HierarXy
Query
expansion
Keyword
Factory
arXlive: a data analysis and production system
Abstract
GeographyAffiliationsAuthors
Search
Novelty
model
HierarXy
Query
expansion
Keyword
Factory
Topic
model
Deep
learning
papers
Deep Learning,
Deep Change
Search for arXiv papers using a
query expansion approach.
Filter results by publication date,
citations, geography, discipline,
arXiv category and novelty.
Novelty: How dissimilar is a paper
from its most similar publications?
HierarXy
Keyword Factory
What else should I be searching
for? Generate lists of relevant
keywords based on arXiv data,
without prior knowledge.
Real-time update of papers
Daily updates of our Deep learning,
deep change? Mapping the development of
the Artificial Intelligence General Purpose
Technology paper.
Why?
● Reduce overheads; Policy
makers can find the most
up-to-date results on arXlive.
● Robustness by design.
● Log unexpected changes as
they happen.
Collect the full text of each publication.
● Track funding in AI research by parsing the paper
acknowledgements.
● Identify the inputs and outputs of AI systems.
● Develop a semantic search engine to enable long text queries.
Real-time updates of our Gender Diversity in AI Research paper.
Incorporate additional altmetrics for arXiv papers.
Visual exploration of the search space.
Next steps
nesta.org.uk
@nesta_uk
Website: https://arxlive.org/
GitHub: https://github.com/nestauk/clio-lite
Thank you!
@kstathou
Natural Language Understanding and Inference:
Benchmarks, Resources, and Approaches
Shane Storks (University of Michigan)
Qiaozi Gao (Michigan State University)
Joyce Y. Chai (University of Michigan)
Understanding Natural Language
● Benchmarks that require deep language understanding that goes beyond
what’s explicitly written, and rely on inference and knowledge of the world.
● Knowledge
○ linguistic knowledge (e.g., Penn Treebank, WordNet)
○ common knowledge (e.g., Freebase, DBpedia, YAGO)
○ commonsense knowledge (e.g., ConceptNet, ATOMIC)
"Jack needed some money, so he went and shook his piggy bank.
He was disappointed when it made no sound."
- Why was Jack disappointed? (Minsky, 2000)
Benchmarks: Data Size
Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
- The trophy would not fit in the brown
suitcase because it was too big.
- What was too big?
A. The trophy
B. The suitcase
Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
- The trophy would not fit in the brown
suitcase because it was too small.
- What was too small?
A. The trophy
B. The suitcase
Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
- Which of these would let the most
heat travel through?
A. a new pair of jeans.
B. a steel spoon in a cafeteria.
C. a cotton candy at a store.
D. a calvin klein cotton hat.
Evidence: Metal is a thermal conductor.
Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
- Text: A black race car starts up in front
of a crowd of people.
- Hypothesis: A man is driving down a
lonely road.
- Label: contradiction
Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
I knocked on my neighbor’s door.
What happened as result?
A. My neighbor invited me in.
B. My neighbor left his house.
Benchmarks
● Coreference Resolution
○ e.g., Winograd Schema Challenge
● Question Answering
○ e.g., SQuAD, OpenBookQA
● Textual Entailment
○ e.g., RTE, SNLI
● Plausible Inference
○ e.g., COPA, ROCStories
● Multiple Tasks
○ e.g., GLUE, DNC
Creating Benchmarks: Criteria and Considerations
● Task Format
○ Classification tasks
○ Open-ended tasks
● Evaluation Scheme
○ Evaluation metrics: objective and easy to calculate
○ Human performance measurement
● Avoiding Data Biases
○ Label distribution bias
○ Question Type Bias in QA
○ Superficial Correlation Bias (gender bias, human stylistic artifacts)
Approaches: General Architecture
● Symbolic approaches
● Statistical approaches
● Latest SOTA use deep neural
network (e.g., transformer) with
built-in pre-trained contextual
embeddings
○ Performance keeps increasing
○ Exceeding human performance
sometimes
Performance Trends
● Many factors may affect progress
on benchmarks
○ Actual task difficulty
○ Data size
○ Year released
○ Number of people working on the
benchmark
○ Data bias
● Performance should be interpreted
with caution
Future Questions
● Doe the benchmark performance really reflect the machine inference
abilities?
● How to explain model behaviors so that humans can understand the
underlying inference process?
● How can we make better use of available knowledge resources?
● How can we train energy/cost efficient models?
○ How the Transformers broke NLP leaderboards - Rogers, 2019
○ Green AI - Schwartz et al., 2019
Creating Benchmarks: Data Biases
● Label Distribution Bias
○ relatively easy to avoid: an equal number of examples for each class
● Question Type Bias in QA
○ distribution of the first words of questions (e.g., CoQA, CommonsenseQA)
○ manually analysis of question categories (e.g., Squad 2.0, ARC)
○ predefined question types (e.g., ProPara)
● Superficial Correlation Bias
○ e.g., gender bias, human stylistic artifacts
○ relatively difficult to avoid
○ adversarial filtering process (e.g., SWAG)
Benchmarks
● Turing Test
○ encouraging machines to deceive humans
○ no feedback on a continuous scale to allow for incremental development
● Early NLP Benchmarks
○ Part-of-speech Tagging
○ Named Entity Recognition
○ Coreference Resolution
○ Information Extraction Jyc: delete this slide
Thank you!
Jyc: at least show two or three slides about approaches:
- One slide on the general architecture
- One slide on example performance? Shane is making a
figure for that, discuss the differences between human
performance and model performance.
Also need a slide to summarize:
- What pending questions from the exercise on
benchmarks.
- What should be some ideas for future direction.
Knowledge Base
Humans perform inference based on vast amount of knowledge about how the
world works. To support machines’ inference ability, a parallel ongoing research
effort in the last several decades is the development of various knowledge
resources.
Knowledge Base Collection
Discuss issues related to collecting knowledge required to perform commonsense
reasoning
Learning and Inference Approaches
● Symbolic Approaches
● Statistical Approaches
● Neural Approaches
Model Generalization
Consequence of previous issue?
Talk about current SOTA models and probing studies (like Niven and Kao, 2019)
PROGRESS IN
COMMERCIAL MACHINE
TRANSLATION SYSTEMS
by Konstantin Savenkov, 

Ph.D., CEO Intento

October 29-30, 2019
Stanford University, Human-Centered Artificial Intelligence (HAI) and AI Index
Workshop on Measurement in AI Policy: Opportunities and Challenges
Intento
Alibaba Amazon Baidu
Cloud
Translate
DeepL eBay
Globalese Google GTCom IBM Iconic Kakao
KantanMT Microsoft Mirai ModernMT Naver Niutrans
Omniscien
Pangea
MT
PROMT PrompsIT Rozetta SAP
SDL Sogou Systran Tencent Tilde Yandex
Youdao
COMMERCIAL MT SYSTEMS
2
All product names, trademarks and registered trademarks are property of their respective owners. All company, product and service names used in this website are for
identification purposes only. Use of these names, trademarks and brands does not imply endorsement.
© Intento, Inc. / October 2019
Intento
VENDOR DYNAMICS (STOCK MODELS)
3
Commercial
Alibaba, Amazon,
Baidu, CloudTranslate,
DeepL, Google,
GTCom, IBM, Mirai,
Microsoft, ModernMT,
Naver, Niutrans,
PROMT, Rozetta, SAP,
SDL, Sogou, Systran,
Tilde, Tencent, Yandex,
Youdao
Preview / Limited
eBay, Kakao, QCRI
0
5
10
15
20
25
Mar 18 Jul 18 Dec 18 Jun 19 Nov 19
Preview
Commercial
Intento, Inc. • June 2019
© Intento, Inc. / October 2019
Intento
SUPPORTED LANGUAGE PAIRS
4
1
100
10000
N
iutrans
G
oogle
Yandex
M
icrosoftv3
Sogou
Baidu
Am
azon
Kakao
Systran
Tencent
SDL
PRO
M
T
G
TC
om
SAP
DeepL
M
odernM
TIBM
W
atson
v3
N
aver
Youdao
Alibaba
eBay
Tilde
1
3
2
54
6
8
272
2 202
1
2
20
24
38
5256
72
9090
111121122139
342
594
756
3 4223 782
7 482
10 50613 340
Total
Unique
* where possible, we have checked via API if all language pairs advertised by the
documentation are supported and removed the pairs we were unable to locate in the API.
** as advertised (not validated via API)
Unique
language pairs
- supported
exclusively by
one provider
© Intento, Inc. / October 2019
Intento
MT QUALITY EVALUATION
5
Intento monitors MT Quality since May 2017 (public report
every 4-6 months).
—
48 popular language pairs, based on WMT and other public
news corpora.
—
Reference-based evaluation using hLEPOR score (n=2000,
statistically significant)
© Intento, Inc. / October 2019
Intento
BEST MT
ENGINES
(AS OF
JUNE 2019)
6
en ru ja de es fr pt it zh cs tr fi ro ko ar nl
en
ru
ja
de
es
fr
pt
it
zh
cs
tr
fi
ro
ko
ar
nl
MT Engines
deepl
google
amazon
yandex
systran-pnmt
modernmt
ibm
promt
microsoft
tencent
baidu
6
In several cases, there’s no
statistically significant difference
between the top engines.
changed since
Jan 2019:
19 pairs
© Intento, Inc. / October 2019
Intento
MORE INVESTMENT IN MT QUALITY GOES
INTO POPULAR LANGUAGE PAIRS
7
Intento
data curation
—
new architectures
—
direct translation
© Intento, Inc. / October 2019
Intento
MT PROGRESS BEYOND LOW-IMPACT CONTENT
REQUIRES MORE THAN GENERIC MODELS
8
Intento
Cross-language
NLP
High-volume low
impact
Low-impact
(inbound etc)
High-impact
generic
High-impact in-
domain
MACHINES
HUMANS
© Intento, Inc. / October 2019
Intento
MT PROGRESS BEYOND LOW-IMPACT CONTENT
REQUIRES MORE THAN GENERIC MODELS
9
Intento
Cross-language
NLP
High-volume low
impact
Low-impact
(inbound etc)
High-impact
generic
High-impact in-
domain
MACHINES
HUMANS
HOT TOPIC
© Intento, Inc. / October 2019
Intento
2018: RAISE OF DOMAIN-ADAPTIVE NMT
10
Intento
Sep
2017
Oct
2018
Nov
2017
May
2018
Jun
2018
Jul
2018
Globalese
Custom
NMT
Lilt
Adaptive
NMT
IBM
Custom
NMT
Microsoft
Custom
Translator
Google
AutoML
Translation
SDL
ETS 8.0
ModernMT
Enterprise
Apr
2018
Systran
PNMT
© Intento, Inc. / October 2019
Intento
2019: CUSTOM TERMINOLOGY SUPPORT
11
Intento
Jun
2018
Oct
2019
Oct
2018
Jan
2019
Apr
2019
Amazon
Translate
Google
Translate
v3
SDL
BeGlobal
4.1
Microsoft
Custom
Translator
Nov
2018
Systran
PNMT
IBM
Custom
NMT
“forced glossary customisation”
“phrase dictionaries”
“custom terminology”
“syntax-aware
custom terminology”
May
2019
Yandex
Cloud
Translate v2
dynamic glossaries
“glossaries”
“glossary feature”
© Intento, Inc. / October 2019
Intento
IMPROVEMENT BEYOND STOCK MODELS
12
Intento
Stock models define starting
points
—

Adaptation based on
Translation Memory and
Terminology drives further
improvement
—

Depends on architecture, data
volume and quality
© Intento, Inc. / October 2019
Intento
GENERIC STOCK MODELS
Alibaba Amazon Baidu DeepL eBay Google
GTCom IBM Kakao Microsoft Mirai ModernMT
Niutrans Naver Omniscien PROMT Rozetta SAP
SDL Sogou Systran Tencent Tilde Yandex
DOMAIN ADAPTATION CAPABILITIES
13© Intento, Inc. / October 2019
VERTICAL STOCK MODELS
CUSTOM TERMINOLOGY SUPPORT
AUTO DOMAIN ADAPTATION MANUAL DOMAIN ADAPTATION
Youdao
Alibaba Baidu
Cloud
Translate
Microsoft Omniscien PROMT
SAP Systran
Amazon Baidu Google IBM Microsoft Rozetta SDL Systran Yandex
Globalese Google IBM
Kantan Microsoft ModernMT
Omniscien SDL Systran
Alibaba Baidu
Cloud
Translate
Iconic
Omniscien PangeaMT Prompsit PROMT
SDL Systran Tilde Yandex
All product names, trademarks and registered trademarks are property of their respective owners. All company, product and service names used in this website are for
identification purposes only. Use of these names, trademarks and brands does not imply endorsement.
Intento
DATA COLLECTION PRACTICES NEED TO
MATCH GROWING MT UBIQUITY
14
Growing MT quality makes it
ubiquitous
—

Enterprise adoption is far behind
user adoption
—

Data collection policy remains in
the fine print of “free” MT services
—

That’s more important than
collecting cookies (we think)
“We recently found that ~2Gb of
confidential data goes from our
network to (free MT service)”
company Y (2019)
“We tried to block traffic to (free MT
service), but SVP said it will stop
the entire company’s operations”
company X (2018)
“We discovered text that had been
typed in on (MT service) could be
found by anyone conducting a web
search.”
Statoil (Sept 2017, link)
© Intento, Inc. / October 2019
THANKS!
by Konstantin Savenkov, 

Ph.D., CEO Intento

October 29-30, 2019
Stanford University, Human-Centered Artificial Intelligence (HAI) and AI Index
Workshop on Measurement in AI Policy: Opportunities and Challenges
THANK YOU!
Konstantin Savenkov

ks@inten.to

2150 Shattuck Ave

Berkeley CA 94704
INTENTO
https://inten.to
16
Quantifying Algorithmic
Improvements over Time
Lars Kotthoff
University of Wyoming
larsko@uwyo.edu1
Measurement in AI Policy Workshop, 30 October 2019
1
Based on Kotthoff, Lars, Alexandre Fréchette, Tomasz P. Michalak, Talal
Rahwan, Holger H. Hoos, and Kevin Leyton-Brown. “Quantifying Algorithmic
Improvements over Time.” In 27th International Joint Conference on Artificial
Intelligence (IJCAI) Special Track on the Evolution of the Contours of AI, 2018.
Key Ideas
▷ science is not a horse race
▷ reward new ideas and complementary approaches
▷ stand on the shoulders of giants, and give credit to those
giants
1
Contributions – Standalone Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
Standalone Performance 2
Contributions – Marginal Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
98900
18
5
5
3
1
0
0
0
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
median 3 random (1978)
median 9 random (1993)
mid (1978)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
Standalone Performance Marginal Performance
3
Contributions – Shapley Value
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
100267058
100167412
100153715
100153384
100151097
100131186
99434662
98059604
84173
98900
18
5
5
3
1
0
0
0
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
median 3 random (1978)
median 9 random (1993)
mid (1978)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
Standalone Performance Shapley Value Marginal Performance 4
Contributions – Temporal Shapley Value
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
100267058
100167412
100153715
100153384
100151097
100131186
99434662
98059604
84173
405450356
392238462
671833
98900
57198
50411
22506
10074
2550
13212030
671833
98900
20497
15703
6907
552
541
137
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
median 3 random (1978)
median 3 (1978)
median 9 random (1993)
first (1961)
Standalone Performance
Shapley Value
Temporal Shapley Value
Temporal Marginal Performance
5
Quicksort Over Time
1e+02
1e+05
1e+08
1946 1961 1978 1993 2009
year
SumoftemporalShapleyvalues
6
SAT Competition
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
144.25268
144.08601
139.91915
57.41773
56.33454
51.75097
44.43418
43.43408
35.08748
28.8338
27.08395
19.16712
19.13952
16.3337
12.43352
10.19402
9.69765
9.0321
8.30224
8.0001
7.75017
7.03675
6.64388
4.87001
4.50018
4.26672
4.26672
3.91678
2.00002
2.00001
1.83337
1.16668
0.83336
4e−05
0
0
0
0
78.42398
63.90765
55.75744
55.15192
52.43065
50.72959
45.33413
44.61692
44.50427
36.26344
31.60638
30.53198
30.41135
28.4814
25.76449
21.82523
20.7125
20.49854
20.15654
19.71357
18.71084
17.82205
16.86642
16.31361
15.36641
14.86641
14.18306
13.18303
9.83857
8.74145
4.81501
3.56499
1.90815
1.88997
0.53389
0.45055
0.14286
0
gnovelty+_2007
ranov_2007
adaptnovelty_2007
TNM_2009
sparrow2011_2011
sapsrt_2007
March−KS_2007
KCNFS_2007
dimetheus_2.100_2014
hybridGM3_2009
iPAWS_2009
sattime2011_2011
BalancedZ_2014
adaptg2wsat2011_2011
DEWSATZ−1A_2007
CSCCSat2014_SC2014_2014
CCgscore_2014
probSAT_sc14_2014
Ncca+_v1.05_2014
CSHCrandMC_2013
gnovelty+2_2009
YalSAT_03l_2014
CCA2014_2.0_2014
sattime_2014
MPhaseSAT_M−2011−02−16_2011
minisat−SAT_2007
MXC_2007
gNovelty+−T_2009
csls−pnorm−8cores_2011
march_br_sat+unsat_2013
march_rw−2011−03−02_2011
MiraXT−v3_2007
march_hi_2011
gNovelty+GCwa_1.0_2013
minipure_1.0.1_2013
Solver43a_a_2013
Solver43b_b_2013
strangenight_satcomp11−st_2013
dimetheus_2.100_2014
BalancedZ_2014
CCgscore_2014
CSCCSat2014_SC2014_2014
probSAT_sc14_2014
Ncca+_v1.05_2014
CCA2014_2.0_2014
sattime_2014
YalSAT_03l_2014
sparrow2011_2011
TNM_2009
sattime2011_2011
adaptg2wsat2011_2011
MPhaseSAT_M−2011−02−16_2011
CSHCrandMC_2013
ranov_2007
iPAWS_2009
gnovelty+_2007
adaptnovelty_2007
hybridGM3_2009
gnovelty+2_2009
gNovelty+−T_2009
march_br_sat+unsat_2013
gNovelty+GCwa_1.0_2013
march_rw−2011−03−02_2011
march_hi_2011
March−KS_2007
KCNFS_2007
csls−pnorm−8cores_2011
sapsrt_2007
DEWSATZ−1A_2007
minipure_1.0.1_2013
MXC_2007
minisat−SAT_2007
MiraXT−v3_2007
Solver43a_a_2013
Solver43b_b_2013
strangenight_satcomp11−st_2013
Temporal Shapley Value Shapley Value
7
SAT Competition Over Time
0
200
400
600
2007 2009 2011 2013 2014
year
SumoftemporalShapleyValues
8
MiniZinc (CP) Competition
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
9987.86155
360.7717
1261.4996
29828.71058
4376.2985
922.81764
49647.2861
0
170.56476
21435.6316
5163.9305
197.036
16878.17865
12376.60633
193.93191
4775.12936
17831.07372
14097.37673
8886.77925
91361.24195
150.02742
20393.03799
192.78811
3267.92072
12513.1466
9828.08541
32189.36215
34970.99061
7059.47499
15552.43907
6190.81406
7087.25379
6756.47091
13073.56821
14389.56267
5262.55503
15095.99322
16977.5178
36090.70992
6071.40394
18324.39121
15746.33068
15810.6008
52.19571
0
6853.37461
11324.41791
Choco_2014
Choco_2016
Choco3_2015
Chuffed_2015
Chuffed_2016
Concrete_2016
G12Chuffed_2014
G12FD_2015
G12FD_2016
Gecode_2014
Gecode_2015
Gecode_2016
JaCoP_2014
JaCoP_2015
JaCoP_2016
LCG−Glucose_2016
OpturionCPX_2014
OpturionCPX_2015
OR−Tools_2015
ORTools_2014
PicatCP_2014
PicatCP_2016
SICStus_2014
SICStus_2016
Choco_2014
Choco_2016
Choco3_2015
Chuffed_2015
Chuffed_2016
Concrete_2016
G12Chuffed_2014
G12FD_2015
G12FD_2016
Gecode_2014
Gecode_2015
Gecode_2016
JaCoP_2014
JaCoP_2015
JaCoP_2016
LCG−Glucose_2016
OpturionCPX_2014
OpturionCPX_2015
OR−Tools_2015
ORTools_2014
PicatCP_2014
PicatCP_2016
SICStus_2014
SICStus_2016
Temporal Shapley Value Shapley Value
9
MiniZinc Competition Over Time
0
50000
100000
150000
200000
2014 2015 2016
year
SumoftemporalShapleyValues
10
Summary
▷ standalone performance does not indicate how algorithms
complement each other
▷ marginal performance is not fair
▷ Shapley Value
▷ provides better characterization of algorithms’ performance
▷ rewards algorithms that introduce novel and complementary
concepts
▷ enables better analysis of algorithms’ performance
▷ Temporal Shapley Value
▷ takes when an algorithm was conceived into account
▷ all desirable properties of Shapley Value
▷ rewards earlier algorithms, which may have inspired later
algorithms
11
Contributions – Temporal Marginal Performance
798602199
798501630
798470169
798466233
798461169
798360514
794178118
784476788
671833
13212030
671833
98900
20497
15703
6907
552
541
137
dual pivot (2009)
median 9 (1993)
median 9 random (1993)
mid (1978)
median 3 random (1978)
random (1961)
median 3 (1978)
first (1961)
insertion (1946)
random (1961)
insertion (1946)
dual pivot (2009)
median 9 (1993)
mid (1978)
median 3 random (1978)
median 3 (1978)
median 9 random (1993)
first (1961)
Standalone Performance Temporal Marginal Performance 12
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
Marcelo O.R. Prates morprates@inf.ufrgs.br
Pedro H.C. Avelar phcavelar@inf.ufrgs.br
Luis C. Lamb lamb@inf.ufrgs.br
October 2019
1
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
ASSESSING GENDER BIAS IN MACHINE
TRANSLATION
This presentation is based on our work “Assessing Gender Bias in Machine
Translation – A Case Study with Google Translate”, (PRATES; AVELAR;
LAMB, 2019) and includes a short description of our work on Quantifying
the Role of Ethics in AI Research (PRATES; AVELAR; LAMB, 2018).
2
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHMACHINE BIAS
• Machine Bias is a topic of great interest in academia and industry.
• Biases have been identified in several systems (ANGWIN et al., 2016;
BOLUKBASI et al., 2016; CHO et al., 2019; GARCIA, 2016; MILLS,
2017; PAPENFUSS, 2017; WEBSTER et al., 2018; ZHAO et al.,
2018).
• “Including gender analysis in research can save us from
life-threatening errors.” (SCHIEBINGER, 2014)
• Thus, solving bias in AI systems is important to achieve a fairer
society.
3
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHBIAS IN WORD EMBEDDINGS
(BOLUKBASI et al., 2016) identified Biases in Word Embeddings and
argued that debiasing was necessary before applying these methods in real
world applications.
There have been hundreds of papers written about word em-
beddings and their applications (...). However, none of these
papers have recognized how blatantly sexist the embeddings are
and hence risk introducing biases of various types into real-world
systems.
(...)
One perspective on bias in word embeddings is that it merely
reflects bias in society, and therefore one should attempt to debias
society rather than word embeddings. However, by reducing the
bias in today’s computer systems (or at least not amplifying the
bias), which is increasingly reliant on word embeddings, in a
small way debiased word embeddings can hopefully contribute
to reducing gender bias in society. At the very least, machine
learning should not be used to inadvertently amplify these biases,
as we have seen can naturally happen. 4
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHGENDER BIAS IN MACHINE TRANSLATION
Figure: Example translations which were trending in social media.
5
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHMAIN IDEAS
• There was great social media interest on solving MT gender bias for
professions, in particular in the translation from gender neutral
languages.
• We this issue, by providing a transparent way of assessing gender bias
in MT systems.
• We provide a case study with a widely used system and compare it
with real world gender distributions.
• Extra: We provide a similar study for adjectives.
6
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHDATA – LANGUAGES
• Languages
• With Gender Neutral Pronouns and supported by GT:
• Armenian, Basque, Bengali, Chinese – Mandarin (pinyin),
Estonian, Finnish, Hungarian, Japanese, Malay, Swahili,
Turkish, Yoruba.
• We did not include some GN Languages (Nepali, Korean and
Persian) due to difficulties in providing template/processing the
data.
7
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHDATA – OCCUPATIONS
• Labour Data
• Extracted from the U.S. Bureau of Labor Statistics (Bureau of
Labor Statistics, 2017)
• Manually curated.
• Most occupations had data on gender distribution.
• Missing data imputed as category aggregate. For example:
- The profession “Sociologists” doesn’t have enough data to
contain a percentage of female participation.
- Its % is imputed as the aggregate in its category “Life,
physical, and social science occupations”.
- Two thousand employed (sociologists), with 47.4% women
(from Life, physical, and social science occupations).
8
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHDATA – ADJECTIVES
• Adjectives
• Extracted from CoCA <https://corpus.byu.edu/coca/>
• Manually curated from the top 1,000 most frequent adjectives.
9
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – OCCUPATION CATEGORY
Category
Healthcare
Production
Education
Farming
Fishing
Forestry
Service
Construction
Extraction
Corporate
Arts
Entertainment
STEM
Legal
Neutral
Female
Male
Gender
0
50
100
%
Figure: Plot showing how different Occupation Categories have different
distributions of translation pronouns.
10
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – LANGUAGE
Language
Basque
Bengali
Yoruba
Chinese
Finnish
Hungarian
Turkish
Japanese
Estonian
Swahili
Armenian
Malay
Neutral
Female
Male
Gender
0
50
100
%
Figure: Plot showing how different Languages have different distributions of
translation pronouns.
11
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – GT VS REAL DISTRIBUTION
12-quantile
1 2 3 4 5 6 7 8 9 10 11 12
0
10
20
30
40
Frequency(%)
Google Translate Female %
BLS Female Participation %
Data
Figure: Plot showing severe underestimation of female participation.
12
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – ADJECTIVES
Adjective
Happy
Shy
Desirable
Sad
Dumb
Mature
Smart
Polite
Sympathetic
Loving
Modest
Wrong
Afraid
Innocent
Strong
Successful
Right
Brave
Cruel
Guilty
Proud
Neutral
Female
Male
Gender
0
50
100
%
Figure: Most adjectives seem to adopt male defaults, but some specific words
show certain trends, as “Guilty”, while some adjectives such as “shy” and
“happy” seem to skew less towards male translations. 13
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRESULTS – IMPROVEMENTS IN GT
Figure: GT provided translation alternatives shortly after our paper.
14
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHLIMITATIONS
• None of us were speakers of the gender-neutral languages.
• None of us identified themselves as female.
• GT doesn’t provide confidence scores for words in the API .
• Our work was limited to a single template translation per word
(except for Bengali).
• The occupation list is from a single source (BLS).
• Occupations were forward translated to be back-translated again.
15
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHPOLICY SUGGESTIONS
• MT tools could provide alternative translations (GT has been updated
to include this).
• MT tools could provide confidence scores for individual words.
• Automatic evaluation can help detect bias in a system and call for
further action.
• Datasets could have a curated subset to enforce parity.
16
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES
• Related work: quantifying the role of ethics in AI research (PRATES;
AVELAR; LAMB, 2018)
• Searched for ethics related keywords in flagship conference abstracts
and titles.
• Although ethics is being more and more commonly discussed in
workshops, it is not typically discussed in the main flagship conference
tracks.
17
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES
Conferences
AAAI IJCAI NIPS ICML ICRA IROS
7, 179 7, 723 6, 509 3, 568 19, 368 15, 005
Journals
ACM
Trans.
Comm.
ACM
IEEE.
Com-
puter
JAIR IEEE
Trans.
AI
Artif.
Intell.
18, 199 11, 394 6, 694 972 10, 731 2, 766
Table: Sample sizes in number of papers for the analysed venues.
18
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES
0
0.002
0.004
0.006
0.008
0.01
0.012
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Averagenºofmatches
Year
AAAI
IJCAI
NIPS
ICML
ICRA
IROS
Figure: Frequency of the selected ethics-related keywords in each five year
interval in paper titles
19
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHRELATED WORK
• Related work:
• (CHO et al., 2019) performed a similar evaluation for Korean on
three different translation tools, using multiple sentence
templates.
• (STANOVSKY; SMITH; ZETTLEMOYER, 2019) evaluated
gender bias for 8 languages and 6 MT systems for correct
translation alignments.
• (KUCZMARSKI; JOHNSON, 2018) proposed techniques to
produce both translations in all genders in the target language.
• (ZHAO et al., 2018; RUDINGER et al., 2018; WEBSTER et al.,
2018) provided corpora for pronoun resolution and assessing
gender bias.
20
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHCHO ET AL.
• Korean speakers
• Provided a way to test MT systems for the Korean language.
• Tested on 3 different MT systems.
• Used multiple sentence templates per pair
21
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHCHO ET AL.
Figure: Cho et al. tested on different systems, including GT and Naver Papago
(NP). Reuse of this image was kindly permitted by Cho et al.
22
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHSTANOVSKY, SMITH, ZETTLEMOYER
• (STANOVSKY; SMITH; ZETTLEMOYER, 2019) based their studies
in previous studies regarding Gender bias in coreference resolution
(ZHAO et al., 2018; RUDINGER et al., 2018).
• Tested on 6 different MT systems, 4 commercial ones.
• Tested sentences based on automatic tools and checking for gender
alignment between the source and target sentences.
• Also performed manual annotation for a small subset of 100 sentences
with 2 native annotators.
23
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHKUCZMARSKI, JOHNSON
• Proposed techniques to produce both translations in all genders in the
target language.
• In Summary:
• Identify if a translation query may need gendered translation.
• If so, translate the sentence forcing all possible genders in the
target language.
• Post-process to see if produced sentences are appropriate.
• Present gendered tuple to user if so, otherwise translate as
normal.
• Similar to what GT seems to have adopted.
24
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHGENDER BIAS CORPORA
• (ZHAO et al., 2018; RUDINGER et al., 2018; WEBSTER et al.,
2018) provided corpora for gendered pronoun resolution.
• Can be used to benchmark MT tools.
• Also identified and called biases to attention.
25
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCHFUTURE WORK
• Future Work:
• We are not aware of a study similar to (CHO et al., 2019) for
the Persian or Nepali languages.
• Cho et al. are looking to expand their work to multiple
languages.
• We are expanding some of our experiments on bias in MT.
• We are - very - open to collaboration and suggestions.
26
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
THANK YOU.
Thank You!
Contacts:
morprates@inf.ufrgs.br
phcavelar@inf.ufrgs.br
lamb@inf.ufrgs.br
27
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
BIBLIOGRAPHY I
ANGWIN, J. et al. Machine bias: There’s software
used across the country to predict future criminals and
it’s biased against blacks. 2016. Last visited 2017-12-
17. Disponível em: <https://www.propublica.org/article/
machine-bias-risk-assessments-in-criminal-sentencing>.
BOLUKBASI, T. et al. Man is to computer programmer as woman
is to homemaker? debiasing word embeddings. In: NIPS. [S.l.: s.n.],
2016. p. 4349–4357.
Bureau of Labor Statistics. "Table 11: Employed persons by
detailed occupation, sex, race, and Hispanic or Latino ethnicity,
2017". [S.l.], 2017.
28
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
BIBLIOGRAPHY II
CHO, W. I. et al. On measuring gender bias in translation of
gender-neutral pronouns. In: Proceedings of the First Workshop
on Gender Bias in Natural Language Processing. Florence, Italy:
Association for Computational Linguistics, 2019. p. 173–181. Disponível
em: <https://www.aclweb.org/anthology/W19-3824>.
GARCIA, M. Racist in the machine: The disturbing implications of
algorithmic bias. World Policy Journal, Duke Univ Press, v. 33, n. 4,
p. 111–117, 2016.
KUCZMARSKI, J.; JOHNSON, M. Gender-aware natural language
translation. 2018.
MILLS, K.-A. ’Racist’ soap dispenser refuses to help dark-
skinned man wash his hands - but Twitter blames ’technology’.
2017. Last visited 2017-12-17. Disponível em: <http://www.mirror.co.
uk/news/world-news/racist-soap-dispenser-refuses-help-11004385>.
29
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
BIBLIOGRAPHY III
PAPENFUSS, M. Woman In China Says Colleague’s
Face Was Able To Unlock Her iPhone X. 2017. Last visited
2017-12-17. Disponível em: <http://www.huffpostbrasil.com/entry/
iphone-face-recognition-double_us_5a332cbce4b0ff955ad17d50>.
PRATES, M. O. R.; AVELAR, P. H.; LAMB, L. C. Assessing gender
bias in machine translation: a case study with google translate. Neural
Computing and Applications, Mar 2019. ISSN 1433-3058. Disponível
em: <https://doi.org/10.1007/s00521-019-04144-6>.
PRATES, M. O. R.; AVELAR, P. H. C.; LAMB, L. C. On quantifying
and understanding the role of ethics in AI research: A historical account
of flagship conferences and journals. In: GCAI. [S.l.]: EasyChair, 2018.
(EPiC Series in Computing, v. 55), p. 188–201.
RUDINGER, R. et al. Gender bias in coreference resolution. In:
NAACL-HLT (2). [S.l.]: Association for Computational Linguistics,
2018. p. 8–14.
30
GENDER BIAS IN MACHINE
TRANSLATION & QUANTIFYING
ETHICS IN AI RESEARCH
THANK YOU
BIBLIOGRAPHY IV
SCHIEBINGER, L. Scientific research must take gender into account.
Nature, Nature Publishing Group, v. 507, n. 7490, p. 9, 2014.
STANOVSKY, G.; SMITH, N. A.; ZETTLEMOYER, L. Evaluating
gender bias in machine translation. In: ACL (1). [S.l.]: Association for
Computational Linguistics, 2019. p. 1679–1684.
WEBSTER, K. et al. Mind the gap: A balanced corpus of gendered
ambiguous pronouns. In: Transactions of the ACL. [S.l.: s.n.], 2018.
p. to appear.
ZHAO, J. et al. Gender bias in coreference resolution: Evaluation
and debiasing methods. In: NAACL-HLT (2). [S.l.]: Association for
Computational Linguistics, 2018. p. 15–20.
31

More Related Content

Similar to Breakout 1. Research and Development, including Technical Performance.

Ch 1 Introduction to AI.pdf
Ch 1 Introduction to AI.pdfCh 1 Introduction to AI.pdf
Ch 1 Introduction to AI.pdfKrishnaMadala1
 
How do we train AI to be Ethical and Unbiased?
How do we train AI to be Ethical and Unbiased?How do we train AI to be Ethical and Unbiased?
How do we train AI to be Ethical and Unbiased?Mark Borg
 
LEC_2_AI_INTRODUCTION - Copy.pptx
LEC_2_AI_INTRODUCTION - Copy.pptxLEC_2_AI_INTRODUCTION - Copy.pptx
LEC_2_AI_INTRODUCTION - Copy.pptxAjaykumar967485
 
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...Pangea.ai
 
How artificial intelligence can help you today
How artificial intelligence can help you todayHow artificial intelligence can help you today
How artificial intelligence can help you todayHenrik de Gyor
 
Artificial intelligence introduction
Artificial intelligence introductionArtificial intelligence introduction
Artificial intelligence introductionBHAGYAPRASADBUGGE
 
Artificial Intelligence 01 introduction
Artificial Intelligence 01 introduction Artificial Intelligence 01 introduction
Artificial Intelligence 01 introduction Andres Mendez-Vazquez
 
1 Introduction to AI.pptx
1 Introduction to AI.pptx1 Introduction to AI.pptx
1 Introduction to AI.pptxBikashAcharya13
 
AI Mod1@AzDOCUMENTS.in.pdf
AI Mod1@AzDOCUMENTS.in.pdfAI Mod1@AzDOCUMENTS.in.pdf
AI Mod1@AzDOCUMENTS.in.pdfKUMARRISHAV37
 
Machine learning in medicine: calm down
Machine learning in medicine: calm downMachine learning in medicine: calm down
Machine learning in medicine: calm downBenVanCalster
 
Human Intelligence Source Analysis
Human Intelligence Source AnalysisHuman Intelligence Source Analysis
Human Intelligence Source AnalysisLaura Torres
 

Similar to Breakout 1. Research and Development, including Technical Performance. (20)

Ch 1 Introduction to AI.pdf
Ch 1 Introduction to AI.pdfCh 1 Introduction to AI.pdf
Ch 1 Introduction to AI.pdf
 
1.introduction to ai
1.introduction to ai1.introduction to ai
1.introduction to ai
 
Year 1 AI.ppt
Year 1 AI.pptYear 1 AI.ppt
Year 1 AI.ppt
 
How do we train AI to be Ethical and Unbiased?
How do we train AI to be Ethical and Unbiased?How do we train AI to be Ethical and Unbiased?
How do we train AI to be Ethical and Unbiased?
 
LEC_2_AI_INTRODUCTION - Copy.pptx
LEC_2_AI_INTRODUCTION - Copy.pptxLEC_2_AI_INTRODUCTION - Copy.pptx
LEC_2_AI_INTRODUCTION - Copy.pptx
 
#1 Lecture .pptx
#1 Lecture .pptx#1 Lecture .pptx
#1 Lecture .pptx
 
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
A Glimpse Into the Future of Data Science - What's Next for AI, Big Data & Ma...
 
introduction to ai
introduction to aiintroduction to ai
introduction to ai
 
The Ethics of AI
The Ethics of AIThe Ethics of AI
The Ethics of AI
 
How artificial intelligence can help you today
How artificial intelligence can help you todayHow artificial intelligence can help you today
How artificial intelligence can help you today
 
AI_Unit I notes .pdf
AI_Unit I notes .pdfAI_Unit I notes .pdf
AI_Unit I notes .pdf
 
Artificial intelligence introduction
Artificial intelligence introductionArtificial intelligence introduction
Artificial intelligence introduction
 
RAPIDE
RAPIDERAPIDE
RAPIDE
 
Artificial Intelligence 01 introduction
Artificial Intelligence 01 introduction Artificial Intelligence 01 introduction
Artificial Intelligence 01 introduction
 
1 Introduction to AI.pptx
1 Introduction to AI.pptx1 Introduction to AI.pptx
1 Introduction to AI.pptx
 
AI Mod1@AzDOCUMENTS.in.pdf
AI Mod1@AzDOCUMENTS.in.pdfAI Mod1@AzDOCUMENTS.in.pdf
AI Mod1@AzDOCUMENTS.in.pdf
 
Model bias in AI
Model bias in AIModel bias in AI
Model bias in AI
 
Immune Attack - The Concept
Immune Attack - The ConceptImmune Attack - The Concept
Immune Attack - The Concept
 
Machine learning in medicine: calm down
Machine learning in medicine: calm downMachine learning in medicine: calm down
Machine learning in medicine: calm down
 
Human Intelligence Source Analysis
Human Intelligence Source AnalysisHuman Intelligence Source Analysis
Human Intelligence Source Analysis
 

More from Saurabh Mishra

Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...Saurabh Mishra
 
2. Economic Impact and Societal Considerations for Policy Decisions.
2. Economic Impact and Societal Considerations for Policy Decisions.2. Economic Impact and Societal Considerations for Policy Decisions.
2. Economic Impact and Societal Considerations for Policy Decisions.Saurabh Mishra
 

More from Saurabh Mishra (6)

Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
 
2. Economic Impact and Societal Considerations for Policy Decisions.
2. Economic Impact and Societal Considerations for Policy Decisions.2. Economic Impact and Societal Considerations for Policy Decisions.
2. Economic Impact and Societal Considerations for Policy Decisions.
 
EP110
EP110EP110
EP110
 
EP55
EP55EP55
EP55
 
wp13135
wp13135wp13135
wp13135
 
wp15119
wp15119wp15119
wp15119
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Breakout 1. Research and Development, including Technical Performance.

  • 1. 30 October 2019 Ashley J. Llorens Chief, Intelligent Systems Center Johns Hopkins Applied Physics Laboratory www.jhuapl.edu/isc Technology Vectors for Intelligent Systems HAI AI Index – Research and Development Breakout Session
  • 2. A Systems View of Artificial Intelligence • An intelligent system is an agent that has the ability to perceive its environment, decide upon a course of action, act within a framework of acceptable actions, and team with humans and other agents to accomplish a human-specified mission. • Even when performing tasks autonomously, an intelligent system is always part of a human-machine team. • To facilitate effective delegation of tasks to an agent, humans must have appropriately calibrated trust in the agent’s capabilities. • We see it as imperative that advancements in AI and associated metrics span these key attributes of intelligent system capabilities: perceive, decide, act, team, trust. 2Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory An AI-assisted handshake at JHU/APL’s Intelligent Systems Center
  • 3. 3 Envisioned Futures Enabled by Intelligent Systems Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory • Over the past year, the Johns Hopkins Applied Physics Laboratory (JHU/APL) performed an analysis of envisioned futures for national security, space exploration and human health that could potentially be enabled by targeted advancements in intelligent systems. • This effort has produced four essential technology vectors to guide the progress of artificial intelligence in the coming decades towards addressing critical national global challenges.
  • 4. 4 Technology Vector 2: Superhuman decision-making and autonomous action: Systems that identify, evaluate, select, and execute effective courses of action with superhuman speed and accuracy for real-world challenges Technology Vector 1: Autonomous perception: Systems that reason about their environment, focus on the mission-critical aspects of the scene, understand the intent of humans and other machines and learn through exploration Technology Vector 3: Human-machine teaming at the speed of thought: Systems that understand human intent and work in collaboration with humans to perform tasks that are difficult or impossible for humans to carry out with speed and accuracy Technology Vector 4: Safe and assured operation: Systems that are robust to real-world perturbation and resilient to adversarial attacks with ethical reasoning and goals that are guaranteed to remain aligned with human intent Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory
  • 5. Machines select and perform appropriate behaviors 5Intelligent Systems CenterJohns Hopkins Applied Physics Laboratory Collaborating to Advance Artificial Intelligence • JHU/APL aims to accelerate progress along these through our own research and by engaging the broader ecosystem. • Challenge problems are an important tool for sparking collaboration towards key advancements. • Reconnaissance Blind Chess was crafted with this mind and is a featured challenge at this year’s Neural Information Processing Systems (NeurIPS 2019). • We see our Technology Vectors as focal points for charting the landscape of emerging developments in AI while identifying gaps and informing future investments and policy development.
  • 6. 6
  • 7. Towards more meaningful evaluations in AI Christopher Potts Stanford Linguistics AI Index Workshop on Measurement in AI Policy Special thanks to Atticus Geiger and Robin Jia for helpful discussion!
  • 8. Standard evaluations 1.  Create a dataset from a single process 2.  Divide the dataset into disjoint train and test sets, and set the test set aside. 3.  Develop systems on the train set. 4.  Only after all system development is complete, evaluate the systems based on accuracy on the test set. 5.  Report the results as providing an estimate of the system’s capacity to generalize.
  • 10. The Natural Language Inference (NLI) task Premise Relation Hypothesis 1. turtle contradicts linguist 2. A turtle danced. entails A turtle moved. 4. Some turtles walk. neutral Some rabbits move. 5. James Byron Dean refused to move without blue jeans. entails James Dean didn’t dance without pants. 6. Mitsubishi Motors Corp’s new vehicle sales in the US fell 46 percent in June. contradicts Mitsubishi’s sales rose 46 percent.
  • 11. Stanford Natural Language Inference Corpus (SNLI)
  • 12. The best NLI systems fail on mildly adversarial tests Premise Relation Hypothesis Train A little girl kneeling in the dirt crying. entails A little girl is very sad. Adversarial entails A little girl is very unhappy. Premise Relation Hypothesis Train A woman is pulling a child on a sled in the snow. entails A child is sitting on a sled in the snow. Adversarial A child is pulling a woman on a sled in the snow. neutral
  • 13. The Stanford Question Answering Dataset (SQuAD) Passage Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. Question What is the name of the quarterback who was 38 in Super Bowl XXXIII? Answer John Elway
  • 14.
  • 15. Training example Passage Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV. Question What is the name of the quarterback who was 38 in Super Bowl XXXIII? Answer John Elway
  • 16. Adversarial test example (Jia et al., EMNLP 2017) Passage Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV. Question What is the name of the quarterback who was 38 in Super Bowl XXXIII? Answer John Elway Model Prediction: Jeff Dean
  • 17. Devastating effects The average performance of 16 published models trained on SQuAD drops from a 75% F1 score to a 36% F1 score System performance is also shuffled, suggesting a certain brittleness.
  • 18. Measuring human performance Premise Relation Hypothesis 1. turtle linguist 2. A turtle danced. A dog jumped. 4. A photo of a race horse. A photo of an athlete 5. A chef using a barbecue. A person using a machine. 6. Mitsubishi Motors Corp’s new vehicle sales in the US fell 46 percent in June. Mitsubishi’s sales rose 46 percent. contradicts ??? neutral ??? ??? Our human tasks are machine tasks and therefore understate human performance.
  • 19. The Turing Test A machine’s behavior is intelligent if it can trick a human interrogator into thinking it is human using only conversation.
  • 20. People are bad at the Turing Test! Report from the first Turing Test (Schieber 1994) Cynthia Clay, the Shakespeare aficionado, was thrice misclassified as a computer. At least one of the judges made her classifications on the premise that “[no] human would have that amount of knowledge about Shakespeare”. Turing Test event at the University of Reading “A computer program called Eugene Goostman, which simulates a 13-year-old Ukrainian boy, is said to have passed the Turing test”
  • 21. Somewhere between accuracy and Turing tests Can a system perform more accurately on a friendly test set than a human performing that same machine task? (Standard) Can a system perform like a human in open-ended adversarial communication? (Turing test) Thanks! Can a system behave systematically (even if it’s not accurate)? Can a system assess its own confidence – know when not to make a prediction? Can a system make people happier and more productive?
  • 22. Dewey Murdick Director of Data Science dewey.murdick@georgetown.edu https://cset.georgetown.edu 1 Highlighting ongoing work done in collaboration with Michael Page, Daniel Chou, James Dunham, and Jennifer Melot
  • 23. Connecting policymakers to high-quality analysis of emerging technologies and their security implications (initial focus on AI) #Nonpartisan #EmergingTech #AIpolicy #security #analysis #ML #AI 2 PC: Georgetown University
  • 24. Questions drive CSET… for example: 1. Will the big tech companies dominate the frontier of AI R&D in 3-5 years? 2. What impact does private-sector AI innovation have on a country’s hard and soft power? 3. How much progress on AI in China is due to indigenous research vs. legal and/or extralegal tech transfers? 4. What does collaboration with the defense-sector mean for a tech company (within and outside of the HQ country)? 5. Will China develop an indigenous semiconductor industry competitive with the US? 3
  • 25. Questions drive CSET… for example: 1. Will the big tech companies dominate the frontier of AI R&D in 3-5 years? 2. What impact does private-sector AI innovation have on a country’s hard and soft power? 3. How much progress on AI in China is due to indigenous research vs. legal and/or extralegal tech transfers? 4. What does collaboration with the defense-sector mean for a tech company (within and outside of the HQ country)? 5. Will China develop an indigenous semiconductor industry competitive with the US? 4 Measures and metrics stay linked to questions (we think metrics without context can lead to confusion in DC)
  • 26. Example 1 - Will the big tech companies dominate the frontier of AI R&D in 3-5 years? ● Research output over time ○ Industry vs. academia: % of top papers (by citation and venue) ○ Within industry: % of top industry papers (by citation and venue) ● Talent acquisition type and hiring rates over time ○ Absolute and relative number of job postings by corporation and AI-relevant skill sets ○ Fraction of top-tier AI talent within the industry (e.g., résumés and CVs) ● Investment, funding flows, and market share over time ○ Absolute and relative measures for research and development grants & contracts ○ Public and private company investments (e.g., M&A, private equity, etc.) ○ Number of innovative product releases (w/ AI-integration), market type and share, etc. ● Calibrated community-of-practice-based technical forecasts ○ Probability community will mature (e.g., workforce size, corporate involvement, investment) ○ Forecasted applications; community research level, technology readiness, horizon 1-3, etc. Note: Developing cross-source indicators (e.g., fusion by organization) 5
  • 27. ● Hard Power - Candidate Measures and Metrics ○ Flow of private-sector talent to defense agencies or defense contractors ○ Level of AI RDT&E investment (e.g., funding, staff) by defense agencies ○ Number and fraction of top AI companies that take defense contracts ○ Number of defense systems (develop, deploy) that apply AI capabilities ● Soft Power - Candidate Measures and Metrics ○ Presence at top international ML conf. & international collaboration rates ○ Fraction of AI workforce trained within a given country ○ Net skilled talent flow in/out of country; fraction of foreign talent that emigrate ○ Role in establishing international governance structures and norms for AI Example 2 - What impact does private-sector AI innovation have on a country’s hard & soft power? 6
  • 28. Active Lines of Research (AI Focus) 7 AI Applications & Implications Competitiveness State of Play Forecasting Talent Investment Hardware Data, algorithms & models Alliances AI safety Weapons Military power Cyber operations
  • 29. Fusion of foreign and domestic S&T data sources 3. Workforce / Talent ○ Job Postings (English, Chinese) ○ CVs and Resume Data ○ FOIA Visa / Immigration Data / Port of Entry Data (English) 4. Analyst-directed data sources ○ Targeted Surveys ○ Human Annotated Data ○ Prioritized Translations ○ Intent / Policy Docs, etc. And more... 8 1. Technical Text ○ Scholarly Literature (English, Chinese, Russian) ○ Dissertations & Theses (Chinese, English, etc.) ○ Tech News (Chinese, Worldwide) ○ Patents (Worldwide) ○ News wire (Worldwide) 2. Worldwide Funding ○ Grant Funding ○ Financial Transactions for Publicly and Privately-held Corporations ○ Venture Financial Transactions ○ Spending by governments
  • 30. Upcoming What we’re doing 1. Releasing analytic reports 2. Launch fortnightly e-newsletter 3. Acquiring and improving relevant data sets 4. Establishing and calibrating a forecasting capability 5. Next CSET Seminar, Remco Zwetsloot, Nov 20 (in DC) What you can do 1. Subscribe to our e-newsletter at cset.georgetown.edu 2. Tell us how we can help -- what are your AI-related questions, and what knowledge gaps have you seen? 3. Help develop new indicator features & language models 4. Help develop good measures and metrics that answer key AI questions 9
  • 31. 10
  • 32.
  • 33. “Moore’s Law” of Academic Knowledge: > 1 M titles/month
  • 37. Anyone with knowledge of computer science research will see these rankings for what they are – nonsense – and ignore them. But others may be seriously misled.
  • 38. It is unreasonable to expect that departments half-way around the world will have anything close to an accurate assessment of each other …the methodology makes inferences from the wrong data without transparency and, consequently, it arrives at an absurd ranking.
  • 39.
  • 40.
  • 41.
  • 42. Inference & Reasoning Machine Readers (NLP) Reinforcement Learning 𝑐(𝑡) 𝑠(𝑡) 𝑠(𝑡 − 𝑇) Knowledge graph Search results & Recommendation Citation behavior Existing ranking AI Components in Microsoft Academic Entities+ Relations
  • 44.
  • 45. 30 October 2019 Maria de Kleijn, Senior Vice President Analytical Services Artificial Intelligence Peer reviewed research – volume and quality metrics
  • 46. | 2 Experts agree there is no common definition on AI “There is no commonly agreed ontology for AI” “It’s just statistics on steroids” “An umbrella term to describe the capability to make computers apply judgment as a human being would” “Many people say AI when they actually mean machine learning”
  • 47. | 3 AI corpus definition at article level
  • 48. | 4 Data on peer-reviewed articles and conference proceedings from Scopus Article 70+ million Journal, conference, & Book records Author 16+ million Author profiles (active) Affiliation 70,000+ Affiliation profiles Other sources used for quantitative analysis • Preprint servers (arXiv) • PlumX dashboard • Online competitions (Kaggle) • ScienceDirect • Graduate information (CAS, China) 76 million Items 16 million Author profiles ~70,000 Affiliation Profiles 1.4 billion cited references dating back to 1970
  • 49. | 5 Globally, AI structures into seven research clusters Search and Optimization Fuzzy Systems Planning and Decision Making Natural Language Processing and Knowledge Representation Computer Vision Neural Networks Machine Learning and Probabilistic Reasoning Using AI to define and structure AI • Trained classifier to distinguish AI papers from non-AI papers • Supervised learning using keyword co- occurrence to structure the field
  • 50. | 6 Source: Scopus Research output per year, per cluster, globally
  • 51. | 7 US: strong corporate sector Key Contribution (academic and corporate institutions) Number of publications (all) Field-Weighted Citation Impact
  • 52. | 8 US: attracting overseas talent to its corporations -318
  • 53. | 9 AI research is found in computer science, and in application areas like medicine, energy, biochemistry Topic cluster “semantics; models; recommender systems”
  • 54. | 10 “Semantics; models; recommender systems” in itself has interdisciplinary components
  • 55. | 11 Conclusion • Tracking research in AI poses particular challenges, that can be overcome with machine learning – “using AI to define AI” • It takes a well structured database – linking articles to authors and institutions – to get insights beyond simple volume metrics • Scientometrics can help answer key policy questions like brain drain/gain and the role of corporates • AI research moving from ‘core’ computer science to application fields is visible in the data • Insights go beyond metrics!
  • 56. | 12 Available resources AI Resource Center: https://www.elsevier.com/connect/ai- resource-center Download AI Report: https://www.elsevier.com/research- intelligence/ai-report
  • 59. | 15 w How is AI being taught? How is AI researched? How is AI being talked about in media? How is AI being described in patents? Achieving policy objectives requires actions across sectors
  • 60. | 16 AI corpus verification
  • 61. | 17 Keywords shared across all 4 perspectives: • Artificial Intelligence • Deep Learning • Machine Learning • Neural Network • Reinforcement Learning • Speech Recognition` AI seems to lack a common language Teaching 268 Media 82 Research 42 Industry 641
  • 62. | 18 US: ability to also retain strong researchers Migratory Outflow Migratory Inflow Transitory Sedentary 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00 Relative productivity Relative impact versus relative productivity
  • 64. Motivation How do we break down “AI” and search for its subfields? How do we discover novel research as it happens? How do we enable policymakers search for specialised topics without the help of experts?
  • 65. Motivation What is arXlive? How do we break down “AI” and search for its subfields? How do we discover novel research as it happens? How do we enable policymakers search for specialised topics without the help of experts? A scalable, shareable and flexible open source platform for real-time monitoring of research activity in arXiv preprints.
  • 66. arXlive: a data analysis and production system Abstract Authors
  • 67. arXlive: a data analysis and production system Abstract GeographyAffiliationsAuthors
  • 68. arXlive: a data analysis and production system Abstract GeographyAffiliationsAuthors Search Novelty model HierarXy
  • 69. arXlive: a data analysis and production system Abstract GeographyAffiliationsAuthors Search Novelty model HierarXy Query expansion Keyword Factory
  • 70. arXlive: a data analysis and production system Abstract GeographyAffiliationsAuthors Search Novelty model HierarXy Query expansion Keyword Factory Topic model Deep learning papers Deep Learning, Deep Change
  • 71. Search for arXiv papers using a query expansion approach. Filter results by publication date, citations, geography, discipline, arXiv category and novelty. Novelty: How dissimilar is a paper from its most similar publications? HierarXy
  • 72. Keyword Factory What else should I be searching for? Generate lists of relevant keywords based on arXiv data, without prior knowledge.
  • 73. Real-time update of papers Daily updates of our Deep learning, deep change? Mapping the development of the Artificial Intelligence General Purpose Technology paper. Why? ● Reduce overheads; Policy makers can find the most up-to-date results on arXlive. ● Robustness by design. ● Log unexpected changes as they happen.
  • 74. Collect the full text of each publication. ● Track funding in AI research by parsing the paper acknowledgements. ● Identify the inputs and outputs of AI systems. ● Develop a semantic search engine to enable long text queries. Real-time updates of our Gender Diversity in AI Research paper. Incorporate additional altmetrics for arXiv papers. Visual exploration of the search space. Next steps
  • 76. Natural Language Understanding and Inference: Benchmarks, Resources, and Approaches Shane Storks (University of Michigan) Qiaozi Gao (Michigan State University) Joyce Y. Chai (University of Michigan)
  • 77. Understanding Natural Language ● Benchmarks that require deep language understanding that goes beyond what’s explicitly written, and rely on inference and knowledge of the world. ● Knowledge ○ linguistic knowledge (e.g., Penn Treebank, WordNet) ○ common knowledge (e.g., Freebase, DBpedia, YAGO) ○ commonsense knowledge (e.g., ConceptNet, ATOMIC) "Jack needed some money, so he went and shook his piggy bank. He was disappointed when it made no sound." - Why was Jack disappointed? (Minsky, 2000)
  • 79. Benchmarks ● Coreference Resolution ○ e.g., Winograd Schema Challenge ● Question Answering ○ e.g., SQuAD, OpenBookQA ● Textual Entailment ○ e.g., RTE, SNLI ● Plausible Inference ○ e.g., COPA, ROCStories ● Multiple Tasks ○ e.g., GLUE, DNC
  • 80. Benchmarks ● Coreference Resolution ○ e.g., Winograd Schema Challenge ● Question Answering ○ e.g., SQuAD, OpenBookQA ● Textual Entailment ○ e.g., RTE, SNLI ● Plausible Inference ○ e.g., COPA, ROCStories ● Multiple Tasks ○ e.g., GLUE, DNC - The trophy would not fit in the brown suitcase because it was too big. - What was too big? A. The trophy B. The suitcase
  • 81. Benchmarks ● Coreference Resolution ○ e.g., Winograd Schema Challenge ● Question Answering ○ e.g., SQuAD, OpenBookQA ● Textual Entailment ○ e.g., RTE, SNLI ● Plausible Inference ○ e.g., COPA, ROCStories ● Multiple Tasks ○ e.g., GLUE, DNC - The trophy would not fit in the brown suitcase because it was too small. - What was too small? A. The trophy B. The suitcase
  • 82. Benchmarks ● Coreference Resolution ○ e.g., Winograd Schema Challenge ● Question Answering ○ e.g., SQuAD, OpenBookQA ● Textual Entailment ○ e.g., RTE, SNLI ● Plausible Inference ○ e.g., COPA, ROCStories ● Multiple Tasks ○ e.g., GLUE, DNC - Which of these would let the most heat travel through? A. a new pair of jeans. B. a steel spoon in a cafeteria. C. a cotton candy at a store. D. a calvin klein cotton hat. Evidence: Metal is a thermal conductor.
  • 83. Benchmarks ● Coreference Resolution ○ e.g., Winograd Schema Challenge ● Question Answering ○ e.g., SQuAD, OpenBookQA ● Textual Entailment ○ e.g., RTE, SNLI ● Plausible Inference ○ e.g., COPA, ROCStories ● Multiple Tasks ○ e.g., GLUE, DNC - Text: A black race car starts up in front of a crowd of people. - Hypothesis: A man is driving down a lonely road. - Label: contradiction
  • 84. Benchmarks ● Coreference Resolution ○ e.g., Winograd Schema Challenge ● Question Answering ○ e.g., SQuAD, OpenBookQA ● Textual Entailment ○ e.g., RTE, SNLI ● Plausible Inference ○ e.g., COPA, ROCStories ● Multiple Tasks ○ e.g., GLUE, DNC I knocked on my neighbor’s door. What happened as result? A. My neighbor invited me in. B. My neighbor left his house.
  • 85. Benchmarks ● Coreference Resolution ○ e.g., Winograd Schema Challenge ● Question Answering ○ e.g., SQuAD, OpenBookQA ● Textual Entailment ○ e.g., RTE, SNLI ● Plausible Inference ○ e.g., COPA, ROCStories ● Multiple Tasks ○ e.g., GLUE, DNC
  • 86. Creating Benchmarks: Criteria and Considerations ● Task Format ○ Classification tasks ○ Open-ended tasks ● Evaluation Scheme ○ Evaluation metrics: objective and easy to calculate ○ Human performance measurement ● Avoiding Data Biases ○ Label distribution bias ○ Question Type Bias in QA ○ Superficial Correlation Bias (gender bias, human stylistic artifacts)
  • 87. Approaches: General Architecture ● Symbolic approaches ● Statistical approaches ● Latest SOTA use deep neural network (e.g., transformer) with built-in pre-trained contextual embeddings ○ Performance keeps increasing ○ Exceeding human performance sometimes
  • 88. Performance Trends ● Many factors may affect progress on benchmarks ○ Actual task difficulty ○ Data size ○ Year released ○ Number of people working on the benchmark ○ Data bias ● Performance should be interpreted with caution
  • 89. Future Questions ● Doe the benchmark performance really reflect the machine inference abilities? ● How to explain model behaviors so that humans can understand the underlying inference process? ● How can we make better use of available knowledge resources? ● How can we train energy/cost efficient models? ○ How the Transformers broke NLP leaderboards - Rogers, 2019 ○ Green AI - Schwartz et al., 2019
  • 90. Creating Benchmarks: Data Biases ● Label Distribution Bias ○ relatively easy to avoid: an equal number of examples for each class ● Question Type Bias in QA ○ distribution of the first words of questions (e.g., CoQA, CommonsenseQA) ○ manually analysis of question categories (e.g., Squad 2.0, ARC) ○ predefined question types (e.g., ProPara) ● Superficial Correlation Bias ○ e.g., gender bias, human stylistic artifacts ○ relatively difficult to avoid ○ adversarial filtering process (e.g., SWAG)
  • 91. Benchmarks ● Turing Test ○ encouraging machines to deceive humans ○ no feedback on a continuous scale to allow for incremental development ● Early NLP Benchmarks ○ Part-of-speech Tagging ○ Named Entity Recognition ○ Coreference Resolution ○ Information Extraction Jyc: delete this slide
  • 92. Thank you! Jyc: at least show two or three slides about approaches: - One slide on the general architecture - One slide on example performance? Shane is making a figure for that, discuss the differences between human performance and model performance. Also need a slide to summarize: - What pending questions from the exercise on benchmarks. - What should be some ideas for future direction.
  • 93. Knowledge Base Humans perform inference based on vast amount of knowledge about how the world works. To support machines’ inference ability, a parallel ongoing research effort in the last several decades is the development of various knowledge resources.
  • 94. Knowledge Base Collection Discuss issues related to collecting knowledge required to perform commonsense reasoning
  • 95. Learning and Inference Approaches ● Symbolic Approaches ● Statistical Approaches ● Neural Approaches
  • 96. Model Generalization Consequence of previous issue? Talk about current SOTA models and probing studies (like Niven and Kao, 2019)
  • 97. PROGRESS IN COMMERCIAL MACHINE TRANSLATION SYSTEMS by Konstantin Savenkov, Ph.D., CEO Intento October 29-30, 2019 Stanford University, Human-Centered Artificial Intelligence (HAI) and AI Index Workshop on Measurement in AI Policy: Opportunities and Challenges
  • 98. Intento Alibaba Amazon Baidu Cloud Translate DeepL eBay Globalese Google GTCom IBM Iconic Kakao KantanMT Microsoft Mirai ModernMT Naver Niutrans Omniscien Pangea MT PROMT PrompsIT Rozetta SAP SDL Sogou Systran Tencent Tilde Yandex Youdao COMMERCIAL MT SYSTEMS 2 All product names, trademarks and registered trademarks are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement. © Intento, Inc. / October 2019
  • 99. Intento VENDOR DYNAMICS (STOCK MODELS) 3 Commercial Alibaba, Amazon, Baidu, CloudTranslate, DeepL, Google, GTCom, IBM, Mirai, Microsoft, ModernMT, Naver, Niutrans, PROMT, Rozetta, SAP, SDL, Sogou, Systran, Tilde, Tencent, Yandex, Youdao Preview / Limited eBay, Kakao, QCRI 0 5 10 15 20 25 Mar 18 Jul 18 Dec 18 Jun 19 Nov 19 Preview Commercial Intento, Inc. • June 2019 © Intento, Inc. / October 2019
  • 100. Intento SUPPORTED LANGUAGE PAIRS 4 1 100 10000 N iutrans G oogle Yandex M icrosoftv3 Sogou Baidu Am azon Kakao Systran Tencent SDL PRO M T G TC om SAP DeepL M odernM TIBM W atson v3 N aver Youdao Alibaba eBay Tilde 1 3 2 54 6 8 272 2 202 1 2 20 24 38 5256 72 9090 111121122139 342 594 756 3 4223 782 7 482 10 50613 340 Total Unique * where possible, we have checked via API if all language pairs advertised by the documentation are supported and removed the pairs we were unable to locate in the API. ** as advertised (not validated via API) Unique language pairs - supported exclusively by one provider © Intento, Inc. / October 2019
  • 101. Intento MT QUALITY EVALUATION 5 Intento monitors MT Quality since May 2017 (public report every 4-6 months). — 48 popular language pairs, based on WMT and other public news corpora. — Reference-based evaluation using hLEPOR score (n=2000, statistically significant) © Intento, Inc. / October 2019
  • 102. Intento BEST MT ENGINES (AS OF JUNE 2019) 6 en ru ja de es fr pt it zh cs tr fi ro ko ar nl en ru ja de es fr pt it zh cs tr fi ro ko ar nl MT Engines deepl google amazon yandex systran-pnmt modernmt ibm promt microsoft tencent baidu 6 In several cases, there’s no statistically significant difference between the top engines. changed since Jan 2019: 19 pairs © Intento, Inc. / October 2019
  • 103. Intento MORE INVESTMENT IN MT QUALITY GOES INTO POPULAR LANGUAGE PAIRS 7 Intento data curation — new architectures — direct translation © Intento, Inc. / October 2019
  • 104. Intento MT PROGRESS BEYOND LOW-IMPACT CONTENT REQUIRES MORE THAN GENERIC MODELS 8 Intento Cross-language NLP High-volume low impact Low-impact (inbound etc) High-impact generic High-impact in- domain MACHINES HUMANS © Intento, Inc. / October 2019
  • 105. Intento MT PROGRESS BEYOND LOW-IMPACT CONTENT REQUIRES MORE THAN GENERIC MODELS 9 Intento Cross-language NLP High-volume low impact Low-impact (inbound etc) High-impact generic High-impact in- domain MACHINES HUMANS HOT TOPIC © Intento, Inc. / October 2019
  • 106. Intento 2018: RAISE OF DOMAIN-ADAPTIVE NMT 10 Intento Sep 2017 Oct 2018 Nov 2017 May 2018 Jun 2018 Jul 2018 Globalese Custom NMT Lilt Adaptive NMT IBM Custom NMT Microsoft Custom Translator Google AutoML Translation SDL ETS 8.0 ModernMT Enterprise Apr 2018 Systran PNMT © Intento, Inc. / October 2019
  • 107. Intento 2019: CUSTOM TERMINOLOGY SUPPORT 11 Intento Jun 2018 Oct 2019 Oct 2018 Jan 2019 Apr 2019 Amazon Translate Google Translate v3 SDL BeGlobal 4.1 Microsoft Custom Translator Nov 2018 Systran PNMT IBM Custom NMT “forced glossary customisation” “phrase dictionaries” “custom terminology” “syntax-aware custom terminology” May 2019 Yandex Cloud Translate v2 dynamic glossaries “glossaries” “glossary feature” © Intento, Inc. / October 2019
  • 108. Intento IMPROVEMENT BEYOND STOCK MODELS 12 Intento Stock models define starting points — Adaptation based on Translation Memory and Terminology drives further improvement — Depends on architecture, data volume and quality © Intento, Inc. / October 2019
  • 109. Intento GENERIC STOCK MODELS Alibaba Amazon Baidu DeepL eBay Google GTCom IBM Kakao Microsoft Mirai ModernMT Niutrans Naver Omniscien PROMT Rozetta SAP SDL Sogou Systran Tencent Tilde Yandex DOMAIN ADAPTATION CAPABILITIES 13© Intento, Inc. / October 2019 VERTICAL STOCK MODELS CUSTOM TERMINOLOGY SUPPORT AUTO DOMAIN ADAPTATION MANUAL DOMAIN ADAPTATION Youdao Alibaba Baidu Cloud Translate Microsoft Omniscien PROMT SAP Systran Amazon Baidu Google IBM Microsoft Rozetta SDL Systran Yandex Globalese Google IBM Kantan Microsoft ModernMT Omniscien SDL Systran Alibaba Baidu Cloud Translate Iconic Omniscien PangeaMT Prompsit PROMT SDL Systran Tilde Yandex All product names, trademarks and registered trademarks are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.
  • 110. Intento DATA COLLECTION PRACTICES NEED TO MATCH GROWING MT UBIQUITY 14 Growing MT quality makes it ubiquitous — Enterprise adoption is far behind user adoption — Data collection policy remains in the fine print of “free” MT services — That’s more important than collecting cookies (we think) “We recently found that ~2Gb of confidential data goes from our network to (free MT service)” company Y (2019) “We tried to block traffic to (free MT service), but SVP said it will stop the entire company’s operations” company X (2018) “We discovered text that had been typed in on (MT service) could be found by anyone conducting a web search.” Statoil (Sept 2017, link) © Intento, Inc. / October 2019
  • 111. THANKS! by Konstantin Savenkov, Ph.D., CEO Intento October 29-30, 2019 Stanford University, Human-Centered Artificial Intelligence (HAI) and AI Index Workshop on Measurement in AI Policy: Opportunities and Challenges
  • 112. THANK YOU! Konstantin Savenkov ks@inten.to 2150 Shattuck Ave Berkeley CA 94704 INTENTO https://inten.to 16
  • 113. Quantifying Algorithmic Improvements over Time Lars Kotthoff University of Wyoming larsko@uwyo.edu1 Measurement in AI Policy Workshop, 30 October 2019 1 Based on Kotthoff, Lars, Alexandre Fréchette, Tomasz P. Michalak, Talal Rahwan, Holger H. Hoos, and Kevin Leyton-Brown. “Quantifying Algorithmic Improvements over Time.” In 27th International Joint Conference on Artificial Intelligence (IJCAI) Special Track on the Evolution of the Contours of AI, 2018.
  • 114. Key Ideas ▷ science is not a horse race ▷ reward new ideas and complementary approaches ▷ stand on the shoulders of giants, and give credit to those giants 1
  • 115. Contributions – Standalone Performance 798602199 798501630 798470169 798466233 798461169 798360514 794178118 784476788 671833 dual pivot (2009) median 9 (1993) median 9 random (1993) mid (1978) median 3 random (1978) random (1961) median 3 (1978) first (1961) insertion (1946) dual pivot (2009) median 9 (1993) median 9 random (1993) mid (1978) median 3 random (1978) random (1961) median 3 (1978) first (1961) insertion (1946) Standalone Performance 2
  • 116. Contributions – Marginal Performance 798602199 798501630 798470169 798466233 798461169 798360514 794178118 784476788 671833 98900 18 5 5 3 1 0 0 0 dual pivot (2009) median 9 (1993) median 9 random (1993) mid (1978) median 3 random (1978) random (1961) median 3 (1978) first (1961) insertion (1946) dual pivot (2009) median 9 (1993) median 3 random (1978) median 9 random (1993) mid (1978) median 3 (1978) first (1961) insertion (1946) random (1961) Standalone Performance Marginal Performance 3
  • 117. Contributions – Shapley Value 798602199 798501630 798470169 798466233 798461169 798360514 794178118 784476788 671833 100267058 100167412 100153715 100153384 100151097 100131186 99434662 98059604 84173 98900 18 5 5 3 1 0 0 0 dual pivot (2009) median 9 (1993) median 9 random (1993) mid (1978) median 3 random (1978) random (1961) median 3 (1978) first (1961) insertion (1946) dual pivot (2009) median 9 (1993) median 3 random (1978) median 9 random (1993) mid (1978) median 3 (1978) first (1961) insertion (1946) random (1961) Standalone Performance Shapley Value Marginal Performance 4
  • 118. Contributions – Temporal Shapley Value 798602199 798501630 798470169 798466233 798461169 798360514 794178118 784476788 671833 100267058 100167412 100153715 100153384 100151097 100131186 99434662 98059604 84173 405450356 392238462 671833 98900 57198 50411 22506 10074 2550 13212030 671833 98900 20497 15703 6907 552 541 137 dual pivot (2009) median 9 (1993) median 9 random (1993) mid (1978) median 3 random (1978) random (1961) median 3 (1978) first (1961) insertion (1946) random (1961) insertion (1946) dual pivot (2009) median 9 (1993) mid (1978) median 3 random (1978) median 3 (1978) median 9 random (1993) first (1961) Standalone Performance Shapley Value Temporal Shapley Value Temporal Marginal Performance 5
  • 119. Quicksort Over Time 1e+02 1e+05 1e+08 1946 1961 1978 1993 2009 year SumoftemporalShapleyvalues 6
  • 120. SAT Competition ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 144.25268 144.08601 139.91915 57.41773 56.33454 51.75097 44.43418 43.43408 35.08748 28.8338 27.08395 19.16712 19.13952 16.3337 12.43352 10.19402 9.69765 9.0321 8.30224 8.0001 7.75017 7.03675 6.64388 4.87001 4.50018 4.26672 4.26672 3.91678 2.00002 2.00001 1.83337 1.16668 0.83336 4e−05 0 0 0 0 78.42398 63.90765 55.75744 55.15192 52.43065 50.72959 45.33413 44.61692 44.50427 36.26344 31.60638 30.53198 30.41135 28.4814 25.76449 21.82523 20.7125 20.49854 20.15654 19.71357 18.71084 17.82205 16.86642 16.31361 15.36641 14.86641 14.18306 13.18303 9.83857 8.74145 4.81501 3.56499 1.90815 1.88997 0.53389 0.45055 0.14286 0 gnovelty+_2007 ranov_2007 adaptnovelty_2007 TNM_2009 sparrow2011_2011 sapsrt_2007 March−KS_2007 KCNFS_2007 dimetheus_2.100_2014 hybridGM3_2009 iPAWS_2009 sattime2011_2011 BalancedZ_2014 adaptg2wsat2011_2011 DEWSATZ−1A_2007 CSCCSat2014_SC2014_2014 CCgscore_2014 probSAT_sc14_2014 Ncca+_v1.05_2014 CSHCrandMC_2013 gnovelty+2_2009 YalSAT_03l_2014 CCA2014_2.0_2014 sattime_2014 MPhaseSAT_M−2011−02−16_2011 minisat−SAT_2007 MXC_2007 gNovelty+−T_2009 csls−pnorm−8cores_2011 march_br_sat+unsat_2013 march_rw−2011−03−02_2011 MiraXT−v3_2007 march_hi_2011 gNovelty+GCwa_1.0_2013 minipure_1.0.1_2013 Solver43a_a_2013 Solver43b_b_2013 strangenight_satcomp11−st_2013 dimetheus_2.100_2014 BalancedZ_2014 CCgscore_2014 CSCCSat2014_SC2014_2014 probSAT_sc14_2014 Ncca+_v1.05_2014 CCA2014_2.0_2014 sattime_2014 YalSAT_03l_2014 sparrow2011_2011 TNM_2009 sattime2011_2011 adaptg2wsat2011_2011 MPhaseSAT_M−2011−02−16_2011 CSHCrandMC_2013 ranov_2007 iPAWS_2009 gnovelty+_2007 adaptnovelty_2007 hybridGM3_2009 gnovelty+2_2009 gNovelty+−T_2009 march_br_sat+unsat_2013 gNovelty+GCwa_1.0_2013 march_rw−2011−03−02_2011 march_hi_2011 March−KS_2007 KCNFS_2007 csls−pnorm−8cores_2011 sapsrt_2007 DEWSATZ−1A_2007 minipure_1.0.1_2013 MXC_2007 minisat−SAT_2007 MiraXT−v3_2007 Solver43a_a_2013 Solver43b_b_2013 strangenight_satcomp11−st_2013 Temporal Shapley Value Shapley Value 7
  • 121. SAT Competition Over Time 0 200 400 600 2007 2009 2011 2013 2014 year SumoftemporalShapleyValues 8
  • 122. MiniZinc (CP) Competition ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 9987.86155 360.7717 1261.4996 29828.71058 4376.2985 922.81764 49647.2861 0 170.56476 21435.6316 5163.9305 197.036 16878.17865 12376.60633 193.93191 4775.12936 17831.07372 14097.37673 8886.77925 91361.24195 150.02742 20393.03799 192.78811 3267.92072 12513.1466 9828.08541 32189.36215 34970.99061 7059.47499 15552.43907 6190.81406 7087.25379 6756.47091 13073.56821 14389.56267 5262.55503 15095.99322 16977.5178 36090.70992 6071.40394 18324.39121 15746.33068 15810.6008 52.19571 0 6853.37461 11324.41791 Choco_2014 Choco_2016 Choco3_2015 Chuffed_2015 Chuffed_2016 Concrete_2016 G12Chuffed_2014 G12FD_2015 G12FD_2016 Gecode_2014 Gecode_2015 Gecode_2016 JaCoP_2014 JaCoP_2015 JaCoP_2016 LCG−Glucose_2016 OpturionCPX_2014 OpturionCPX_2015 OR−Tools_2015 ORTools_2014 PicatCP_2014 PicatCP_2016 SICStus_2014 SICStus_2016 Choco_2014 Choco_2016 Choco3_2015 Chuffed_2015 Chuffed_2016 Concrete_2016 G12Chuffed_2014 G12FD_2015 G12FD_2016 Gecode_2014 Gecode_2015 Gecode_2016 JaCoP_2014 JaCoP_2015 JaCoP_2016 LCG−Glucose_2016 OpturionCPX_2014 OpturionCPX_2015 OR−Tools_2015 ORTools_2014 PicatCP_2014 PicatCP_2016 SICStus_2014 SICStus_2016 Temporal Shapley Value Shapley Value 9
  • 123. MiniZinc Competition Over Time 0 50000 100000 150000 200000 2014 2015 2016 year SumoftemporalShapleyValues 10
  • 124. Summary ▷ standalone performance does not indicate how algorithms complement each other ▷ marginal performance is not fair ▷ Shapley Value ▷ provides better characterization of algorithms’ performance ▷ rewards algorithms that introduce novel and complementary concepts ▷ enables better analysis of algorithms’ performance ▷ Temporal Shapley Value ▷ takes when an algorithm was conceived into account ▷ all desirable properties of Shapley Value ▷ rewards earlier algorithms, which may have inspired later algorithms 11
  • 125. Contributions – Temporal Marginal Performance 798602199 798501630 798470169 798466233 798461169 798360514 794178118 784476788 671833 13212030 671833 98900 20497 15703 6907 552 541 137 dual pivot (2009) median 9 (1993) median 9 random (1993) mid (1978) median 3 random (1978) random (1961) median 3 (1978) first (1961) insertion (1946) random (1961) insertion (1946) dual pivot (2009) median 9 (1993) mid (1978) median 3 random (1978) median 3 (1978) median 9 random (1993) first (1961) Standalone Performance Temporal Marginal Performance 12
  • 126. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCH GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCH Marcelo O.R. Prates morprates@inf.ufrgs.br Pedro H.C. Avelar phcavelar@inf.ufrgs.br Luis C. Lamb lamb@inf.ufrgs.br October 2019 1
  • 127. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCH ASSESSING GENDER BIAS IN MACHINE TRANSLATION This presentation is based on our work “Assessing Gender Bias in Machine Translation – A Case Study with Google Translate”, (PRATES; AVELAR; LAMB, 2019) and includes a short description of our work on Quantifying the Role of Ethics in AI Research (PRATES; AVELAR; LAMB, 2018). 2
  • 128. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHMACHINE BIAS • Machine Bias is a topic of great interest in academia and industry. • Biases have been identified in several systems (ANGWIN et al., 2016; BOLUKBASI et al., 2016; CHO et al., 2019; GARCIA, 2016; MILLS, 2017; PAPENFUSS, 2017; WEBSTER et al., 2018; ZHAO et al., 2018). • “Including gender analysis in research can save us from life-threatening errors.” (SCHIEBINGER, 2014) • Thus, solving bias in AI systems is important to achieve a fairer society. 3
  • 129. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHBIAS IN WORD EMBEDDINGS (BOLUKBASI et al., 2016) identified Biases in Word Embeddings and argued that debiasing was necessary before applying these methods in real world applications. There have been hundreds of papers written about word em- beddings and their applications (...). However, none of these papers have recognized how blatantly sexist the embeddings are and hence risk introducing biases of various types into real-world systems. (...) One perspective on bias in word embeddings is that it merely reflects bias in society, and therefore one should attempt to debias society rather than word embeddings. However, by reducing the bias in today’s computer systems (or at least not amplifying the bias), which is increasingly reliant on word embeddings, in a small way debiased word embeddings can hopefully contribute to reducing gender bias in society. At the very least, machine learning should not be used to inadvertently amplify these biases, as we have seen can naturally happen. 4
  • 130. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHGENDER BIAS IN MACHINE TRANSLATION Figure: Example translations which were trending in social media. 5
  • 131. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHMAIN IDEAS • There was great social media interest on solving MT gender bias for professions, in particular in the translation from gender neutral languages. • We this issue, by providing a transparent way of assessing gender bias in MT systems. • We provide a case study with a widely used system and compare it with real world gender distributions. • Extra: We provide a similar study for adjectives. 6
  • 132. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHDATA – LANGUAGES • Languages • With Gender Neutral Pronouns and supported by GT: • Armenian, Basque, Bengali, Chinese – Mandarin (pinyin), Estonian, Finnish, Hungarian, Japanese, Malay, Swahili, Turkish, Yoruba. • We did not include some GN Languages (Nepali, Korean and Persian) due to difficulties in providing template/processing the data. 7
  • 133. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHDATA – OCCUPATIONS • Labour Data • Extracted from the U.S. Bureau of Labor Statistics (Bureau of Labor Statistics, 2017) • Manually curated. • Most occupations had data on gender distribution. • Missing data imputed as category aggregate. For example: - The profession “Sociologists” doesn’t have enough data to contain a percentage of female participation. - Its % is imputed as the aggregate in its category “Life, physical, and social science occupations”. - Two thousand employed (sociologists), with 47.4% women (from Life, physical, and social science occupations). 8
  • 134. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHDATA – ADJECTIVES • Adjectives • Extracted from CoCA <https://corpus.byu.edu/coca/> • Manually curated from the top 1,000 most frequent adjectives. 9
  • 135. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHRESULTS – OCCUPATION CATEGORY Category Healthcare Production Education Farming Fishing Forestry Service Construction Extraction Corporate Arts Entertainment STEM Legal Neutral Female Male Gender 0 50 100 % Figure: Plot showing how different Occupation Categories have different distributions of translation pronouns. 10
  • 136. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHRESULTS – LANGUAGE Language Basque Bengali Yoruba Chinese Finnish Hungarian Turkish Japanese Estonian Swahili Armenian Malay Neutral Female Male Gender 0 50 100 % Figure: Plot showing how different Languages have different distributions of translation pronouns. 11
  • 137. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHRESULTS – GT VS REAL DISTRIBUTION 12-quantile 1 2 3 4 5 6 7 8 9 10 11 12 0 10 20 30 40 Frequency(%) Google Translate Female % BLS Female Participation % Data Figure: Plot showing severe underestimation of female participation. 12
  • 138. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHRESULTS – ADJECTIVES Adjective Happy Shy Desirable Sad Dumb Mature Smart Polite Sympathetic Loving Modest Wrong Afraid Innocent Strong Successful Right Brave Cruel Guilty Proud Neutral Female Male Gender 0 50 100 % Figure: Most adjectives seem to adopt male defaults, but some specific words show certain trends, as “Guilty”, while some adjectives such as “shy” and “happy” seem to skew less towards male translations. 13
  • 139. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHRESULTS – IMPROVEMENTS IN GT Figure: GT provided translation alternatives shortly after our paper. 14
  • 140. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHLIMITATIONS • None of us were speakers of the gender-neutral languages. • None of us identified themselves as female. • GT doesn’t provide confidence scores for words in the API . • Our work was limited to a single template translation per word (except for Bengali). • The occupation list is from a single source (BLS). • Occupations were forward translated to be back-translated again. 15
  • 141. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHPOLICY SUGGESTIONS • MT tools could provide alternative translations (GT has been updated to include this). • MT tools could provide confidence scores for individual words. • Automatic evaluation can help detect bias in a system and call for further action. • Datasets could have a curated subset to enforce parity. 16
  • 142. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES • Related work: quantifying the role of ethics in AI research (PRATES; AVELAR; LAMB, 2018) • Searched for ethics related keywords in flagship conference abstracts and titles. • Although ethics is being more and more commonly discussed in workshops, it is not typically discussed in the main flagship conference tracks. 17
  • 143. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES Conferences AAAI IJCAI NIPS ICML ICRA IROS 7, 179 7, 723 6, 509 3, 568 19, 368 15, 005 Journals ACM Trans. Comm. ACM IEEE. Com- puter JAIR IEEE Trans. AI Artif. Intell. 18, 199 11, 394 6, 694 972 10, 731 2, 766 Table: Sample sizes in number of papers for the analysed venues. 18
  • 144. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHETHICS IN AI CONFERENCES 0 0.002 0.004 0.006 0.008 0.01 0.012 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 Averagenºofmatches Year AAAI IJCAI NIPS ICML ICRA IROS Figure: Frequency of the selected ethics-related keywords in each five year interval in paper titles 19
  • 145. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHRELATED WORK • Related work: • (CHO et al., 2019) performed a similar evaluation for Korean on three different translation tools, using multiple sentence templates. • (STANOVSKY; SMITH; ZETTLEMOYER, 2019) evaluated gender bias for 8 languages and 6 MT systems for correct translation alignments. • (KUCZMARSKI; JOHNSON, 2018) proposed techniques to produce both translations in all genders in the target language. • (ZHAO et al., 2018; RUDINGER et al., 2018; WEBSTER et al., 2018) provided corpora for pronoun resolution and assessing gender bias. 20
  • 146. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHCHO ET AL. • Korean speakers • Provided a way to test MT systems for the Korean language. • Tested on 3 different MT systems. • Used multiple sentence templates per pair 21
  • 147. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHCHO ET AL. Figure: Cho et al. tested on different systems, including GT and Naver Papago (NP). Reuse of this image was kindly permitted by Cho et al. 22
  • 148. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHSTANOVSKY, SMITH, ZETTLEMOYER • (STANOVSKY; SMITH; ZETTLEMOYER, 2019) based their studies in previous studies regarding Gender bias in coreference resolution (ZHAO et al., 2018; RUDINGER et al., 2018). • Tested on 6 different MT systems, 4 commercial ones. • Tested sentences based on automatic tools and checking for gender alignment between the source and target sentences. • Also performed manual annotation for a small subset of 100 sentences with 2 native annotators. 23
  • 149. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHKUCZMARSKI, JOHNSON • Proposed techniques to produce both translations in all genders in the target language. • In Summary: • Identify if a translation query may need gendered translation. • If so, translate the sentence forcing all possible genders in the target language. • Post-process to see if produced sentences are appropriate. • Present gendered tuple to user if so, otherwise translate as normal. • Similar to what GT seems to have adopted. 24
  • 150. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHGENDER BIAS CORPORA • (ZHAO et al., 2018; RUDINGER et al., 2018; WEBSTER et al., 2018) provided corpora for gendered pronoun resolution. • Can be used to benchmark MT tools. • Also identified and called biases to attention. 25
  • 151. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCHFUTURE WORK • Future Work: • We are not aware of a study similar to (CHO et al., 2019) for the Persian or Nepali languages. • Cho et al. are looking to expand their work to multiple languages. • We are expanding some of our experiments on bias in MT. • We are - very - open to collaboration and suggestions. 26
  • 152. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCH THANK YOU THANK YOU. Thank You! Contacts: morprates@inf.ufrgs.br phcavelar@inf.ufrgs.br lamb@inf.ufrgs.br 27
  • 153. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCH THANK YOU BIBLIOGRAPHY I ANGWIN, J. et al. Machine bias: There’s software used across the country to predict future criminals and it’s biased against blacks. 2016. Last visited 2017-12- 17. Disponível em: <https://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing>. BOLUKBASI, T. et al. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In: NIPS. [S.l.: s.n.], 2016. p. 4349–4357. Bureau of Labor Statistics. "Table 11: Employed persons by detailed occupation, sex, race, and Hispanic or Latino ethnicity, 2017". [S.l.], 2017. 28
  • 154. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCH THANK YOU BIBLIOGRAPHY II CHO, W. I. et al. On measuring gender bias in translation of gender-neutral pronouns. In: Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Florence, Italy: Association for Computational Linguistics, 2019. p. 173–181. Disponível em: <https://www.aclweb.org/anthology/W19-3824>. GARCIA, M. Racist in the machine: The disturbing implications of algorithmic bias. World Policy Journal, Duke Univ Press, v. 33, n. 4, p. 111–117, 2016. KUCZMARSKI, J.; JOHNSON, M. Gender-aware natural language translation. 2018. MILLS, K.-A. ’Racist’ soap dispenser refuses to help dark- skinned man wash his hands - but Twitter blames ’technology’. 2017. Last visited 2017-12-17. Disponível em: <http://www.mirror.co. uk/news/world-news/racist-soap-dispenser-refuses-help-11004385>. 29
  • 155. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCH THANK YOU BIBLIOGRAPHY III PAPENFUSS, M. Woman In China Says Colleague’s Face Was Able To Unlock Her iPhone X. 2017. Last visited 2017-12-17. Disponível em: <http://www.huffpostbrasil.com/entry/ iphone-face-recognition-double_us_5a332cbce4b0ff955ad17d50>. PRATES, M. O. R.; AVELAR, P. H.; LAMB, L. C. Assessing gender bias in machine translation: a case study with google translate. Neural Computing and Applications, Mar 2019. ISSN 1433-3058. Disponível em: <https://doi.org/10.1007/s00521-019-04144-6>. PRATES, M. O. R.; AVELAR, P. H. C.; LAMB, L. C. On quantifying and understanding the role of ethics in AI research: A historical account of flagship conferences and journals. In: GCAI. [S.l.]: EasyChair, 2018. (EPiC Series in Computing, v. 55), p. 188–201. RUDINGER, R. et al. Gender bias in coreference resolution. In: NAACL-HLT (2). [S.l.]: Association for Computational Linguistics, 2018. p. 8–14. 30
  • 156. GENDER BIAS IN MACHINE TRANSLATION & QUANTIFYING ETHICS IN AI RESEARCH THANK YOU BIBLIOGRAPHY IV SCHIEBINGER, L. Scientific research must take gender into account. Nature, Nature Publishing Group, v. 507, n. 7490, p. 9, 2014. STANOVSKY, G.; SMITH, N. A.; ZETTLEMOYER, L. Evaluating gender bias in machine translation. In: ACL (1). [S.l.]: Association for Computational Linguistics, 2019. p. 1679–1684. WEBSTER, K. et al. Mind the gap: A balanced corpus of gendered ambiguous pronouns. In: Transactions of the ACL. [S.l.: s.n.], 2018. p. to appear. ZHAO, J. et al. Gender bias in coreference resolution: Evaluation and debiasing methods. In: NAACL-HLT (2). [S.l.]: Association for Computational Linguistics, 2018. p. 15–20. 31