SlideShare a Scribd company logo
1 of 70
Download to read offline
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Domain and Social Bias in Natural Language Processing: A Case
Study in Language Identification
Timothy Baldwin
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Talk Outline
1 Introduction
Bias
Language Identification
2 Take 1: Starting out Naively
3 Take 2: Removing the Domain Bias
4 Take 3: Removing the Monolingual Bias
5 Take 4: Removing Social Bias
6 Conclusions
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What is Bias?
Lots of different biases in NLP/ML, but biases of particular interest in this talk
are:
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What is Bias?
Lots of different biases in NLP/ML, but biases of particular interest in this talk
are:
domain bias = systematic statistical differences between “domains”, usually in
the form of vocab, class priors and conditional probabilities
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What is Bias?
Lots of different biases in NLP/ML, but biases of particular interest in this talk
are:
domain bias = systematic statistical differences between “domains”, usually in
the form of vocab, class priors and conditional probabilities
dataset bias = sampling bias in a given dataset, giving rise to (often
unintentional) content bias
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What is Bias?
Lots of different biases in NLP/ML, but biases of particular interest in this talk
are:
domain bias = systematic statistical differences between “domains”, usually in
the form of vocab, class priors and conditional probabilities
dataset bias = sampling bias in a given dataset, giving rise to (often
unintentional) content bias
model bias = bias in the output of a model wrt gold standard
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What is Bias?
Lots of different biases in NLP/ML, but biases of particular interest in this talk
are:
domain bias = systematic statistical differences between “domains”, usually in
the form of vocab, class priors and conditional probabilities
dataset bias = sampling bias in a given dataset, giving rise to (often
unintentional) content bias
model bias = bias in the output of a model wrt gold standard
social bias = model bias towards/against particular socio-demographic groups
of users
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Domain Bias: Why should we Care?
Domain robustness: want to avoid the “Penn Treebank effect” of
over-specialising to a particular domain and not being able to achieve the same
performance levels over other input domains
domain adaptation = sub-area of NLP/ML concerned with explicitly learning
distribution differences between the “source” and “target” domains, and use
this to modify the parameters of a model trained on the source domain, to
better fit the data in the target domain [Daum´e III, 2007, Ben-David et al.,
2010]
generally based on an assumption of access to data in the target domain,
usually in the form of a small labelled set of data
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Dataset Bias: why should we Care?
Representativeness: datasets are core to NLP/ML, in terms of setting
challenges and benchmarking progress, but if those datasets aren’t
representative of the true population, any success is hollow at best!
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Dataset Bias: why should we Care?
Representativeness: datasets are core to NLP/ML, in terms of setting
challenges and benchmarking progress, but if those datasets aren’t
representative of the true population, any success is hollow at best!
our datasets are considerably more discriminative than naively thought
[Torralba and Efros, 2011, Tommasi et al., 2015]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Dataset Bias: why should we Care?
Representativeness: datasets are core to NLP/ML, in terms of setting
challenges and benchmarking progress, but if those datasets aren’t
representative of the true population, any success is hollow at best!
our datasets are considerably more discriminative than naively thought
[Torralba and Efros, 2011, Tommasi et al., 2015]
our datasets are socially biased in all sorts of ways [Zhao et al., 2017]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Dataset Bias: why should we Care?
Representativeness: datasets are core to NLP/ML, in terms of setting
challenges and benchmarking progress, but if those datasets aren’t
representative of the true population, any success is hollow at best!
our datasets are considerably more discriminative than naively thought
[Torralba and Efros, 2011, Tommasi et al., 2015]
our datasets are socially biased in all sorts of ways [Zhao et al., 2017]
Impact on model bias: dataset bias tends to be mirrored/amplified by the
model [Zhao et al., 2017]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Social Bias: why should we Care?
Social equity: do we really want to be in the business of building models that
only cater to certain sub-populations or that overtly discriminate against
certain sub-populations? [Hovy and Søgaard, 2015, Hovy and Spruit, 2016]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Social Bias: why should we Care?
Social equity: do we really want to be in the business of building models that
only cater to certain sub-populations or that overtly discriminate against
certain sub-populations? [Hovy and Søgaard, 2015, Hovy and Spruit, 2016]
Legal adherence: to be able to apply NLP/ML in many domains by public
and private organisations, it has to be demonstrably adherent to
anti-discrimination laws
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Social Bias: why should we Care?
Social equity: do we really want to be in the business of building models that
only cater to certain sub-populations or that overtly discriminate against
certain sub-populations? [Hovy and Søgaard, 2015, Hovy and Spruit, 2016]
Legal adherence: to be able to apply NLP/ML in many domains by public
and private organisations, it has to be demonstrably adherent to
anti-discrimination laws
Simple approaches such as “fairness through unawareness” and “demographic
parity” do not solve the problem [Hardt et al., 2016]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What is Language Identification (“LangID”)?
⾃然語⾔處理是⼈⼯智慧和語⾔學領域的
分⽀學科。在這此領域中探討如何處理及
運⽤⾃然語⾔;⾃然語⾔認知則是指讓電
腦「懂」⼈類的語⾔。⾃然語⾔⽣成系統
把計算機數據轉化為⾃然語⾔。⾃然語⾔
理解系統把⾃然語⾔轉化為計算機程序更
易于處理的形式。
Natural Language processing (NLP)
is a field of computer science and
linguistics concerned with the
interactions between computers
and human (natural) languages. In
theory, natural-language
processing is a very attractive
method of human-computer
interaction.
L'Elaborazione del linguaggio
naturale, detta anche NLP
(dall'inglese Natural Language
Processing), è il processo di
estrazione di informazioni
semantiche da espressioni del
linguaggio umano o naturale, scritte
o parlate, tramite l'elaborazione di
un calcolatore elettronico.
⾃然⾔語処理は、⼈間が日常的に使ってい
る⾃然⾔語をコンピュータに処理させる⼀
連の技術であり、⼈⼯知能と⾔語学の⼀分
野である。計算⾔語学も同じ意味であるが、
前者は⼯学的な視点からの⾔語処理をさ
すのに対して、後者は⾔語学的視点を重視
する⼿法をさす事が多い。データベース内の
情報を⾃然⾔語に変換したり、⾃然⾔語の
⽂章をより形式的な(コンピュータが理解し
やすい)表現に変換するといった処理が含
まれる。
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
How Hard can it be?
What is the language of the following document:
S˚a sitter du ˚ater p˚a handlar’ns trapp och gr˚ater s˚a ¨overgivet.
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
How Hard can it be?
What is the language of the following document:
S˚a sitter du ˚ater p˚a handlar’ns trapp och gr˚ater s˚a ¨overgivet.
Swedish
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Getting Wilder ...
What is the language of the following document:
Revolution is `a la mode at the moment in the country, where the joie
de vivre of the citizens was once again plunged into chaos after a
third coup d’´etat in as many years. Although the leading general is
by no means an enfant terrible per se, the fledgling economy still
stands to be jettisoned down la poubelle
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Getting Wilder ...
What is the language of the following document:
Revolution is `a la mode at the moment in the country, where the joie
de vivre of the citizens was once again plunged into chaos after a
third coup d’´etat in as many years. Although the leading general is
by no means an enfant terrible per se, the fledgling economy still
stands to be jettisoned down la poubelle
English
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
... And Wilder Still ...
What is the language of the following document:
Jawaranya ngu yidanyi ngaba ngu yardi yaniya cool drink ninaka
nanga alangi-nka.
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
... And Wilder Still ...
What is the language of the following document:
Jawaranya ngu yidanyi ngaba ngu yardi yaniya cool drink ninaka
nanga alangi-nka.
Wambaya
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
... And the Actual Wildness of the Problem
What is the language of the following document:
111000111000001010000100111000111000000
110111011111000111000001110111100
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
... And the Actual Wildness of the Problem
What is the language of the following document:
111000111000001010000100111000111000000
110111011111000111000001110111100
Too hard? A hint ...
11100011 10000010 10000100 11100011
10000001 10111011 11100011 10000011
10111100
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
... And the Actual Wildness of the Problem
What is the language of the following document:
111000111000001010000100111000111000000
110111011111000111000001110111100
Too hard? A hint ...
11100011 10000010 10000100 11100011
10000001 10111011 11100011 10000011
10111100
Japanese
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Talk Outline
1 Introduction
Bias
Language Identification
2 Take 1: Starting out Naively
3 Take 2: Removing the Domain Bias
4 Take 3: Removing the Monolingual Bias
5 Take 4: Removing Social Bias
6 Conclusions
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Preliminary LangID Setting
To get going, let’s assume that:
1 documents are all from a homogeneous source = closed-world domain
assumption
2 all documents are monolingual = single label assumption
3 languages are homogeneous = single variety assumption
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Preliminary LangID Setting
To get going, let’s assume that:
1 documents are all from a homogeneous source = closed-world domain
assumption
2 all documents are monolingual = single label assumption
3 languages are homogeneous = single variety assumption
Here and throughout, we’ll model the task as a supervised classification
problem
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Preliminary LangID Setting
To get going, let’s assume that:
1 documents are all from a homogeneous source = closed-world domain
assumption
2 all documents are monolingual = single label assumption
3 languages are homogeneous = single variety assumption
Here and throughout, we’ll model the task as a supervised classification
problem
To get us going, let’s play around with simple multinomial naive Bayes over
byte n-grams
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Machine Learning in LangID
Supervised classification task
Model of labeled training data used to label “unseen” data
Similar to text classification (“TC”)
vector space models, naive Bayes models, logistic regression
Text representation:
TC: bag of words
LangID: bag of byte-sequences
Source(s): Hughes et al. [2006]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Text Representation for LangID
Input: language identification
1-grams: l, a, n, g, u, a ...
2-grams: la, an, ng, gu, ua, ag ...
3-grams: lan, ang, ngu, gua, uag, age ...
4-grams: lang, angu, ngua, guag, uage, age ...
...
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Datasets for Preliminary Experiments
Dataset Docs Langs Doc Length (bytes)
JRC-Acquis 20000 22 18478.5±60836.8
Debian 21735 89 12329.8±30902.7
RCV2 20000 13 3382.7±1671.8
Wikipedia 20000 68 7531.3±16522.2
Source(s): Steinberger et al. [2006], Lui and Baldwin [2011]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
In-domain Results
JR
C
-A
cquis
D
ebian
R
C
V
2
W
ikipedia
0
20
40
60
80
100
99
96 97
94
Accuracy(%)
In-domain
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
In-domain vs. Cross-domain Results
JR
C
-A
cquis
D
ebian
R
C
V
2
W
ikipedia
0
20
40
60
80
100
99
96 97
9493
82
58
74
Accuracy(%)
In-domain Cross-domain
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What’s Happening? Transfer Learning
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Finding
In-domain, supervised language identification (over longish, monolingual
documents) is relatively trivial for even basic learning algorithms, but
overfitting tends to be chronic
Source(s): Lui and Baldwin [2011, 2012]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Talk Outline
1 Introduction
Bias
Language Identification
2 Take 1: Starting out Naively
3 Take 2: Removing the Domain Bias
4 Take 3: Removing the Monolingual Bias
5 Take 4: Removing Social Bias
6 Conclusions
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Cross-domain Feature Selection
Goal: identify n-grams with:
high language association
low domain association
Measure relationship using information gain (= difference in entropy after
partitioning the data in a given way)
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Language- vs. Domain-based IG
0.05 0.10 0.15 0.20 0.25 0.30 0.35
IG Language
0.1
0.2
0.3
0.4
0.5
0.6
IGDomain
0.0
0.8
1.6
2.4
3.2
4.0
4.8
5.6
log10(N)
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What does this Mean?
There are two distinct groups of features: (1) IG for language is strongly
correlated with that for domain; and (2) IG for language is largely
independent of that for domain
the second of these is what we are interested in
Automatically detecting language- (and not domain-) associated features:
LDall
(t) = IGall
lang (t) − IGdomain(t)
Also consider a variant which is debiased to training data volume per language:
LDbin
(t|l) = IGbin
language(t|l) − IGdomain(t)
Source(s): Lui and Baldwin [2011]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Computational Bottleneck
Calculating the IG for low-order n-grams is fine, but it quickly becomes
intractable for larger values of n
Ideally, we want a method which scales to (very) large numbers of
features/high n-gram orders, with little computational overhead
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
DF vs. LD
5000 10000 15000 20000 25000 30000 35000 40000
Document Frequency
 0.4
 0.3
 0.2
 0.1
0.0
0.1
0.2
0.3
LangDomainscore
0.0
0.8
1.6
2.4
3.2
4.0
4.8
5.6
log10(N)
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Observation
Low DF is a good predictor of low LD (but not vice versa)
As DF is much cheaper to compute, we first identify the 15000 features with
highest DF for a given n-gram order, and assign a LD score of 0 to all
features outside this set
The final feature representation is a combination of the top-N features for
each a predefined set of n-grams
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Other Details
Byte-based tokenisation
Multinomial naive Bayes learner
1–9-grams (although little gained in going beyond 4-grams)
Four training/test datasets, plus three test-only datasets (N-EuroGOV,
N-TCL and Wikipedia)
Source(s): Baldwin and Lui [2010], Lui and Baldwin [2011, 2012]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
In-domain vs. Cross-domain Results
JR
C
-A
cquis
D
ebian
R
C
V
2
W
ikipedia
0
20
40
60
80
100
99
96 97
9493
82
58
74
97
87
71
67
99
94
99
92
Accuracy(%)
In-domain IGbin
lang LDall
LDbin
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Per-language Accuracy as |L| Increases
0.0 0.2 0.4 0.6 0.8 1.0
Accuracy (Domain Languages)
0.0
0.2
0.4
0.6
0.8
1.0
Accuracy(AllLanguages)
TCL
Wikipedia
EuroGOV
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Findings
Assuming we have a reasonable number of different domains, LDbin
is an
effective approach to feature selection, and when paired with DF ranking, can
be scaled to very, very large feature sets, and large numbers of languages
More training languages tends not to hurt performance, and in some instances
actually helps
Reference implementation:
https://github.com/saffsd/langid.py
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Talk Outline
1 Introduction
Bias
Language Identification
2 Take 1: Starting out Naively
3 Take 2: Removing the Domain Bias
4 Take 3: Removing the Monolingual Bias
5 Take 4: Removing Social Bias
6 Conclusions
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What about Multilingual Documents? I
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
What about Multilingual Documents? II
Reasons for documents being multilingual:
code-switching
translations
boilerplate/interface language
multiple users
Why detect multilingual documents?
pre-filtering for monolingual NLP
mining bilingual texts for MT
low-density languages on the web
Task setting in this section:
estimate the relative language proportions for a given mono/multi-lingual
document
Source(s): Lui et al. [2014]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Generative Mixture Models
P(wi) =
T
j=1
P(wi|zi = j)P(zi = j)
P(wi ): Probability of the ith
token
P(wi |zi = j): Probability of wi given label j
P(zi = j): Probability of the ith
label being j
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Latent Dirichlet Allocation (LDA)
Source(s): Blei et al. [2003]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Gibbs Sampling: LDA vs. Multilingual LangID
LDA Multilingual LangID
labels z topics language
tokens w words
language-indicative byte
sequences
token–label
distribution
φ
infer using
Gibbs sampler
estimate using
training data
label–document dis-
tribution
θ infer using Gibbs sampler
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
How many Languages?
Use Gibbs sampling to infer the most likely mixture of languages over a fixed
set:
L is the set of languages in the training data
find the most likely subset of languages λ⊂L
Source(s): Lui et al. [2014]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
<
<
>
<
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
WikipediaMulti dataset
Актор народився Німеччини. Мати Кларка
страждала від епілепсії і померла через десять
місяців після його народження, тому хлопчика
виростили батько і мачуха, Джені Данлап. У віці
17 років Гейбл зацікавився театром. Але тільки
у віці 21 р. він зміг почати працювати у
другорозрядних театрах. У Портланді, Орегон
Гейбл познайомився з театральним
менеджером Джозефіною Діллон, яка дуже
допомогла Гейблу у навчанні театральному
мистецтву.Türkiye, Kafkasya ve Orta Asya
bölgelerini kapsayan akademik ve hakemli dergisi.
Disiplinlerarası bir yayın politikası belirleyen
OAKA’ya bilimsel ve orijinal makaleler
gönderilebiliyor. Yılda 2sayı yayınlanan dergi
Bahar ve Güz dönemlerinde çıkıyor. Dergi Türkçe,
Türkçe lehçeleri ve İngilizce makaleleri kabul
ediyor. Ayrıca başka dillerden çeviri de
gönderilebiliyor.Машина на Тюринг е абстрактно
изчислително устройство, описано от
английския математик Алън Тюринг през 1936
г. Тюринг използва машината, за да даде
първото точно определение на понятието
компютърните науки, най-вече в областите
изчислимост и сложност на алгоритмите, както
и в математическата логика. Имената на
инструкциите са големите латински букви A, B
и C. С малка буква s сме означили специалната
инструкция за спиране на изчислението.
synthetic dataset
mixture of monolingual and
multilingual documents
K languages per document
(1≤K≤5)
samples lines from Wikipedia
train: 5000 monolingual documents
dev: 1000 documents for each K
test: 200 documents for each K
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
WikipediaMulti results
0.0 0.2 0.4 0.6 0.8 1.0
Our Method
Linguini
SegLang
Precision
Recall
F-score
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Findings
Extension of langid.py to include prediction of the language mix in a
document (and a Gibbs sampler to predict the language proportions), which
outperforms benchmarks on synthetic and real-world data
Reference implementation:
https://github.com/saffsd/polyglot
Source(s): Lui et al. [2014]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Talk Outline
1 Introduction
Bias
Language Identification
2 Take 1: Starting out Naively
3 Take 2: Removing the Domain Bias
4 Take 3: Removing the Monolingual Bias
5 Take 4: Removing Social Bias
6 Conclusions
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Are all Varieties of the Same Language Equal?
All good to here, but we are still assuming that languages are homogeneous; is
this really the case?
Explore the question by constructing more diverse test datasets from Twitter
across the dimensions of:
1 geographic diversity
2 multilingual diversity
Also generate more socially diverse training datasets
Source(s): Jurgens et al. [2017]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
EquiLID: Socially-equittable LangID
Attempt to make LangID more socially equitable, with particular focus on
Twitter
Model details:
character-based tokenisation
encoder-decoder architecture (3 layers of 512 GRUs) with attention
trained over 3 epochs using adadelta
Training setup:
train on geographic + social/topically-diverse (+ multilingual) data
balance training data across languages
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Results (Monolingual)
langid.py
C
hrom
eC
LD
E
quiLID
0
20
40
60
80
100 96
93
98
23 22
60
77
74
91
38
50
92
F-score(%)
Geoµ GeoM Twitter70µ Twittert70M
Source(s): Jurgens et al. [2017]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
LangID Recall in Application Scenario
Source(s): Jurgens et al. [2017]
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Reflections
Explicitly training models on diverse, balanced data is critically important in
order to achieve “social equity”
Need to tease apart training scenario from modelling details (langid.py not
retrained, e.g.)
Possible extra gains to be had through explicit language-, geographical- and
social-level regularisation during training
Deep learning model clearly more accurate, but efficiency a concern (but see
Botha et al. [2017], e.g.)
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Talk Outline
1 Introduction
Bias
Language Identification
2 Take 1: Starting out Naively
3 Take 2: Removing the Domain Bias
4 Take 3: Removing the Monolingual Bias
5 Take 4: Removing Social Bias
6 Conclusions
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Conclusions
Domain and social bias prevalent in standard datasets and tools
Domain bias well known, but tends to rely on explicit domain adaptation (not
always realistic); “generalisation” harder
Social bias is an emerging area of NLP that is much less understood, but
incredibly important as NLP reaches more and more people in more and more
(social) contexts
Possible to reduce the bias through explicit training regimes
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
Acknowledgements
The following collaborators contributed to this work: Steven Bird, Baden
Hughes, Jey Han Lau, Marco Lui, Andrew MacKinlay and Jeremy Nicholson
This work was supported by the Australian Research Council and NICTA
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
References I
Timothy Baldwin and Marco Lui. Language identification: The long and the short of the matter. In Proceedings of
Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association
for Computational Linguistics (NAACL HLT 2010), pages 229–237, Los Angeles, USA, 2010.
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A
theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning
Research, 3:993–1022, 2003.
Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, and Slav Petrov.
Natural language processing with small feed-forward networks. In Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing (EMNLP 2017), pages 2869–2875, 2017.
Hal Daum´e III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, 2007.
Moritz Hardt, Eric Price, , and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural
Information Processing Systems 29, pages 3315–3323, 2016. URL
http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf.
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
References II
Dirk Hovy and Anders Søgaard. Tagging performance correlates with author age. In Proceedings of the Joint
conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th
International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language
Processing (ACL-IJCNLP 2015), pages 483–488, 2015. URL http://aclweb.org/anthology/P15-2079.
Dirk Hovy and Shannon L. Spruit. The social impact of natural language processing. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 591–598, 2016.
Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and Andrew MacKinlay. Reconsidering language
identification for written language resources. In Proceedings of the 5th International Conference on Language
Resources and Evaluation (LREC 2006), pages 485–488, Genoa, Italy, 2006.
David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. Incorporating dialectal variability for socially equitable language
identification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
(ACL 2017), Volume 2: Short Papers, pages 51–57, 2017.
Marco Lui and Timothy Baldwin. Cross-domain feature selection for language identification. In Proceedings of the
5th International Joint Conference on Natural Language Processing (IJCNLP 2011), pages 553–561, Chiang
Mai, Thailand, 2011.
Marco Lui and Timothy Baldwin. langid.py: An off-the-shelf language identification tool. In Proceedings of the
50th Annual Meeting of the Association for Computational Linguistics (ACL 2012) Demo Session, pages 25–30,
Jeju, Republic of Korea, 2012.
Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017
References III
Marco Lui, Jey Han Lau, and Timothy Baldwin. Automatic detection and language identification of multilingual
documents. Transactions of the Association for Computational Linguistics, 2(Feb):27–40, 2014.
Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaˇz Erjavec, Dan Tufis, and D´aniel Varga.
The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC’2006), Geona, Italy, 2006.
Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. A Deeper Look at Dataset Bias. ArXiv
e-prints, 2015.
Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR 2011), pages 1521–1528, 2011.
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing
gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (EMNLP 2017), pages 2941–2951, 2017.

More Related Content

What's hot (6)

Louise Bowen
Louise BowenLouise Bowen
Louise Bowen
 
Eurocall 2011, Nottingham
Eurocall 2011, NottinghamEurocall 2011, Nottingham
Eurocall 2011, Nottingham
 
Jane Andrews
Jane AndrewsJane Andrews
Jane Andrews
 
Understanding the Identity of a Minority ELL in Singapore
Understanding the Identity of a Minority ELL in SingaporeUnderstanding the Identity of a Minority ELL in Singapore
Understanding the Identity of a Minority ELL in Singapore
 
Kritik english as debate language
Kritik english as debate languageKritik english as debate language
Kritik english as debate language
 
Thorne_Iowa_Pusack 2012
Thorne_Iowa_Pusack 2012Thorne_Iowa_Pusack 2012
Thorne_Iowa_Pusack 2012
 

Similar to Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Language Identification - Tim Baldwin

The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...
Tiffany Sandoval
 
Deep misconceptions and the myth of data driven NLU
Deep misconceptions and the myth of data driven NLUDeep misconceptions and the myth of data driven NLU
Deep misconceptions and the myth of data driven NLU
Walid Saba
 
Linguistic And Social Inequality
Linguistic And Social InequalityLinguistic And Social Inequality
Linguistic And Social Inequality
Dr. Cupid Lucid
 
Running head UNIT VII PROJECT 1UNIT VII PROJECT 6Unit VII Pro.docx
Running head UNIT VII PROJECT 1UNIT VII PROJECT 6Unit VII Pro.docxRunning head UNIT VII PROJECT 1UNIT VII PROJECT 6Unit VII Pro.docx
Running head UNIT VII PROJECT 1UNIT VII PROJECT 6Unit VII Pro.docx
toltonkendal
 

Similar to Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Language Identification - Tim Baldwin (20)

6Th Grade Persuasive Essay Topics. 003 6th Grade Argumentative Essay Topics E...
6Th Grade Persuasive Essay Topics. 003 6th Grade Argumentative Essay Topics E...6Th Grade Persuasive Essay Topics. 003 6th Grade Argumentative Essay Topics E...
6Th Grade Persuasive Essay Topics. 003 6th Grade Argumentative Essay Topics E...
 
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...Are children with_specific_language_impairment_competent_with_the_pragmatics_...
Are children with_specific_language_impairment_competent_with_the_pragmatics_...
 
The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...The Input Learner Learners Forward Throughout...
The Input Learner Learners Forward Throughout...
 
Deep misconceptions and the myth of data driven NLU
Deep misconceptions and the myth of data driven NLUDeep misconceptions and the myth of data driven NLU
Deep misconceptions and the myth of data driven NLU
 
Sociolinguistics (Paper)
Sociolinguistics (Paper)Sociolinguistics (Paper)
Sociolinguistics (Paper)
 
Linguistic And Social Inequality
Linguistic And Social InequalityLinguistic And Social Inequality
Linguistic And Social Inequality
 
Contextual Influence Factors on Educational Scenarios. Due-Publico, Essen. (R...
Contextual Influence Factors on Educational Scenarios. Due-Publico, Essen. (R...Contextual Influence Factors on Educational Scenarios. Due-Publico, Essen. (R...
Contextual Influence Factors on Educational Scenarios. Due-Publico, Essen. (R...
 
LANE422ch2.ppt
LANE422ch2.pptLANE422ch2.ppt
LANE422ch2.ppt
 
Plain language and artificial intelligence
Plain language and artificial intelligencePlain language and artificial intelligence
Plain language and artificial intelligence
 
Second and foreign language data
Second and foreign language dataSecond and foreign language data
Second and foreign language data
 
08 09.4.what is-discourse-2
08 09.4.what is-discourse-208 09.4.what is-discourse-2
08 09.4.what is-discourse-2
 
Sociolinguistics, ch 9
Sociolinguistics, ch 9Sociolinguistics, ch 9
Sociolinguistics, ch 9
 
Sign Language in Primary Education
Sign Language in Primary EducationSign Language in Primary Education
Sign Language in Primary Education
 
English as a lingua franca in academia
English as a lingua franca in academiaEnglish as a lingua franca in academia
English as a lingua franca in academia
 
Uk Essays.pdf
Uk Essays.pdfUk Essays.pdf
Uk Essays.pdf
 
Interlanguage Analysis of Spanish Learners
Interlanguage Analysis of Spanish LearnersInterlanguage Analysis of Spanish Learners
Interlanguage Analysis of Spanish Learners
 
Running head UNIT VII PROJECT 1UNIT VII PROJECT 6Unit VII Pro.docx
Running head UNIT VII PROJECT 1UNIT VII PROJECT 6Unit VII Pro.docxRunning head UNIT VII PROJECT 1UNIT VII PROJECT 6Unit VII Pro.docx
Running head UNIT VII PROJECT 1UNIT VII PROJECT 6Unit VII Pro.docx
 
Pragmatic language impairment in relation to autism and SLI
Pragmatic language impairment in relation to autism and SLIPragmatic language impairment in relation to autism and SLI
Pragmatic language impairment in relation to autism and SLI
 
Chapter 10 toward a theory of second language acquisition
Chapter 10  toward a theory of second language acquisitionChapter 10  toward a theory of second language acquisition
Chapter 10 toward a theory of second language acquisition
 
Why 2017 is the Year to Learn a New Language
Why 2017 is the Year to Learn a New LanguageWhy 2017 is the Year to Learn a New Language
Why 2017 is the Year to Learn a New Language
 

More from Grammarly

Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry Hamon
Grammarly
 

More from Grammarly (14)

Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering Vitalii Braslavskyi - Declarative engineering
Vitalii Braslavskyi - Declarative engineering
 
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
Grammarly AI-NLP Club #10 - Information-Theoretic Probing with Minimum Descri...
 
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
Grammarly AI-NLP Club #9 - Dumpster diving for parallel corpora with efficien...
 
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
Grammarly AI-NLP Club #8 - Arabic Natural Language Processing: Challenges and...
 
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
 
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
Grammarly AI-NLP Club #5 - Automatic text simplification in the biomedical do...
 
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
Grammarly AI-NLP Club #3 - Learning to Read for Automated Fact Checking - Isa...
 
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
 
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100xGrammarly Meetup: DevOps at Grammarly: Scaling 100x
Grammarly Meetup: DevOps at Grammarly: Scaling 100x
 
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
Grammarly Meetup: Memory Networks for Question Answering on Tabular Data - Sv...
 
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
Grammarly AI-NLP Club #2 - Recent advances in applied chatbot technology - Jo...
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy GutsGrammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
Grammarly Meetup: Paraphrase Detection in NLP (PART 1) - Yuriy Guts
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry Hamon
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Grammarly AI-NLP Club #1 - Domain and Social Bias in NLP: Case Study in Language Identification - Tim Baldwin

  • 1. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Timothy Baldwin
  • 2. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Talk Outline 1 Introduction Bias Language Identification 2 Take 1: Starting out Naively 3 Take 2: Removing the Domain Bias 4 Take 3: Removing the Monolingual Bias 5 Take 4: Removing Social Bias 6 Conclusions
  • 3. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What is Bias? Lots of different biases in NLP/ML, but biases of particular interest in this talk are:
  • 4. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What is Bias? Lots of different biases in NLP/ML, but biases of particular interest in this talk are: domain bias = systematic statistical differences between “domains”, usually in the form of vocab, class priors and conditional probabilities
  • 5. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What is Bias? Lots of different biases in NLP/ML, but biases of particular interest in this talk are: domain bias = systematic statistical differences between “domains”, usually in the form of vocab, class priors and conditional probabilities dataset bias = sampling bias in a given dataset, giving rise to (often unintentional) content bias
  • 6. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What is Bias? Lots of different biases in NLP/ML, but biases of particular interest in this talk are: domain bias = systematic statistical differences between “domains”, usually in the form of vocab, class priors and conditional probabilities dataset bias = sampling bias in a given dataset, giving rise to (often unintentional) content bias model bias = bias in the output of a model wrt gold standard
  • 7. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What is Bias? Lots of different biases in NLP/ML, but biases of particular interest in this talk are: domain bias = systematic statistical differences between “domains”, usually in the form of vocab, class priors and conditional probabilities dataset bias = sampling bias in a given dataset, giving rise to (often unintentional) content bias model bias = bias in the output of a model wrt gold standard social bias = model bias towards/against particular socio-demographic groups of users
  • 8. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Domain Bias: Why should we Care? Domain robustness: want to avoid the “Penn Treebank effect” of over-specialising to a particular domain and not being able to achieve the same performance levels over other input domains domain adaptation = sub-area of NLP/ML concerned with explicitly learning distribution differences between the “source” and “target” domains, and use this to modify the parameters of a model trained on the source domain, to better fit the data in the target domain [Daum´e III, 2007, Ben-David et al., 2010] generally based on an assumption of access to data in the target domain, usually in the form of a small labelled set of data
  • 9. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Dataset Bias: why should we Care? Representativeness: datasets are core to NLP/ML, in terms of setting challenges and benchmarking progress, but if those datasets aren’t representative of the true population, any success is hollow at best!
  • 10. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Dataset Bias: why should we Care? Representativeness: datasets are core to NLP/ML, in terms of setting challenges and benchmarking progress, but if those datasets aren’t representative of the true population, any success is hollow at best! our datasets are considerably more discriminative than naively thought [Torralba and Efros, 2011, Tommasi et al., 2015]
  • 11. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Dataset Bias: why should we Care? Representativeness: datasets are core to NLP/ML, in terms of setting challenges and benchmarking progress, but if those datasets aren’t representative of the true population, any success is hollow at best! our datasets are considerably more discriminative than naively thought [Torralba and Efros, 2011, Tommasi et al., 2015] our datasets are socially biased in all sorts of ways [Zhao et al., 2017]
  • 12. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Dataset Bias: why should we Care? Representativeness: datasets are core to NLP/ML, in terms of setting challenges and benchmarking progress, but if those datasets aren’t representative of the true population, any success is hollow at best! our datasets are considerably more discriminative than naively thought [Torralba and Efros, 2011, Tommasi et al., 2015] our datasets are socially biased in all sorts of ways [Zhao et al., 2017] Impact on model bias: dataset bias tends to be mirrored/amplified by the model [Zhao et al., 2017]
  • 13. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Social Bias: why should we Care? Social equity: do we really want to be in the business of building models that only cater to certain sub-populations or that overtly discriminate against certain sub-populations? [Hovy and Søgaard, 2015, Hovy and Spruit, 2016]
  • 14. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Social Bias: why should we Care? Social equity: do we really want to be in the business of building models that only cater to certain sub-populations or that overtly discriminate against certain sub-populations? [Hovy and Søgaard, 2015, Hovy and Spruit, 2016] Legal adherence: to be able to apply NLP/ML in many domains by public and private organisations, it has to be demonstrably adherent to anti-discrimination laws
  • 15. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Social Bias: why should we Care? Social equity: do we really want to be in the business of building models that only cater to certain sub-populations or that overtly discriminate against certain sub-populations? [Hovy and Søgaard, 2015, Hovy and Spruit, 2016] Legal adherence: to be able to apply NLP/ML in many domains by public and private organisations, it has to be demonstrably adherent to anti-discrimination laws Simple approaches such as “fairness through unawareness” and “demographic parity” do not solve the problem [Hardt et al., 2016]
  • 16. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What is Language Identification (“LangID”)? ⾃然語⾔處理是⼈⼯智慧和語⾔學領域的 分⽀學科。在這此領域中探討如何處理及 運⽤⾃然語⾔;⾃然語⾔認知則是指讓電 腦「懂」⼈類的語⾔。⾃然語⾔⽣成系統 把計算機數據轉化為⾃然語⾔。⾃然語⾔ 理解系統把⾃然語⾔轉化為計算機程序更 易于處理的形式。 Natural Language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. In theory, natural-language processing is a very attractive method of human-computer interaction. L'Elaborazione del linguaggio naturale, detta anche NLP (dall'inglese Natural Language Processing), è il processo di estrazione di informazioni semantiche da espressioni del linguaggio umano o naturale, scritte o parlate, tramite l'elaborazione di un calcolatore elettronico. ⾃然⾔語処理は、⼈間が日常的に使ってい る⾃然⾔語をコンピュータに処理させる⼀ 連の技術であり、⼈⼯知能と⾔語学の⼀分 野である。計算⾔語学も同じ意味であるが、 前者は⼯学的な視点からの⾔語処理をさ すのに対して、後者は⾔語学的視点を重視 する⼿法をさす事が多い。データベース内の 情報を⾃然⾔語に変換したり、⾃然⾔語の ⽂章をより形式的な(コンピュータが理解し やすい)表現に変換するといった処理が含 まれる。
  • 17. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 How Hard can it be? What is the language of the following document: S˚a sitter du ˚ater p˚a handlar’ns trapp och gr˚ater s˚a ¨overgivet.
  • 18. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 How Hard can it be? What is the language of the following document: S˚a sitter du ˚ater p˚a handlar’ns trapp och gr˚ater s˚a ¨overgivet. Swedish
  • 19. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Getting Wilder ... What is the language of the following document: Revolution is `a la mode at the moment in the country, where the joie de vivre of the citizens was once again plunged into chaos after a third coup d’´etat in as many years. Although the leading general is by no means an enfant terrible per se, the fledgling economy still stands to be jettisoned down la poubelle
  • 20. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Getting Wilder ... What is the language of the following document: Revolution is `a la mode at the moment in the country, where the joie de vivre of the citizens was once again plunged into chaos after a third coup d’´etat in as many years. Although the leading general is by no means an enfant terrible per se, the fledgling economy still stands to be jettisoned down la poubelle English
  • 21. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 ... And Wilder Still ... What is the language of the following document: Jawaranya ngu yidanyi ngaba ngu yardi yaniya cool drink ninaka nanga alangi-nka.
  • 22. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 ... And Wilder Still ... What is the language of the following document: Jawaranya ngu yidanyi ngaba ngu yardi yaniya cool drink ninaka nanga alangi-nka. Wambaya
  • 23. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 ... And the Actual Wildness of the Problem What is the language of the following document: 111000111000001010000100111000111000000 110111011111000111000001110111100
  • 24. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 ... And the Actual Wildness of the Problem What is the language of the following document: 111000111000001010000100111000111000000 110111011111000111000001110111100 Too hard? A hint ... 11100011 10000010 10000100 11100011 10000001 10111011 11100011 10000011 10111100
  • 25. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 ... And the Actual Wildness of the Problem What is the language of the following document: 111000111000001010000100111000111000000 110111011111000111000001110111100 Too hard? A hint ... 11100011 10000010 10000100 11100011 10000001 10111011 11100011 10000011 10111100 Japanese
  • 26. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Talk Outline 1 Introduction Bias Language Identification 2 Take 1: Starting out Naively 3 Take 2: Removing the Domain Bias 4 Take 3: Removing the Monolingual Bias 5 Take 4: Removing Social Bias 6 Conclusions
  • 27. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Preliminary LangID Setting To get going, let’s assume that: 1 documents are all from a homogeneous source = closed-world domain assumption 2 all documents are monolingual = single label assumption 3 languages are homogeneous = single variety assumption
  • 28. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Preliminary LangID Setting To get going, let’s assume that: 1 documents are all from a homogeneous source = closed-world domain assumption 2 all documents are monolingual = single label assumption 3 languages are homogeneous = single variety assumption Here and throughout, we’ll model the task as a supervised classification problem
  • 29. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Preliminary LangID Setting To get going, let’s assume that: 1 documents are all from a homogeneous source = closed-world domain assumption 2 all documents are monolingual = single label assumption 3 languages are homogeneous = single variety assumption Here and throughout, we’ll model the task as a supervised classification problem To get us going, let’s play around with simple multinomial naive Bayes over byte n-grams
  • 30. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Machine Learning in LangID Supervised classification task Model of labeled training data used to label “unseen” data Similar to text classification (“TC”) vector space models, naive Bayes models, logistic regression Text representation: TC: bag of words LangID: bag of byte-sequences Source(s): Hughes et al. [2006]
  • 31. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Text Representation for LangID Input: language identification 1-grams: l, a, n, g, u, a ... 2-grams: la, an, ng, gu, ua, ag ... 3-grams: lan, ang, ngu, gua, uag, age ... 4-grams: lang, angu, ngua, guag, uage, age ... ...
  • 32. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Datasets for Preliminary Experiments Dataset Docs Langs Doc Length (bytes) JRC-Acquis 20000 22 18478.5±60836.8 Debian 21735 89 12329.8±30902.7 RCV2 20000 13 3382.7±1671.8 Wikipedia 20000 68 7531.3±16522.2 Source(s): Steinberger et al. [2006], Lui and Baldwin [2011]
  • 33. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 In-domain Results JR C -A cquis D ebian R C V 2 W ikipedia 0 20 40 60 80 100 99 96 97 94 Accuracy(%) In-domain
  • 34. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 In-domain vs. Cross-domain Results JR C -A cquis D ebian R C V 2 W ikipedia 0 20 40 60 80 100 99 96 97 9493 82 58 74 Accuracy(%) In-domain Cross-domain
  • 35. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What’s Happening? Transfer Learning
  • 36. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Finding In-domain, supervised language identification (over longish, monolingual documents) is relatively trivial for even basic learning algorithms, but overfitting tends to be chronic Source(s): Lui and Baldwin [2011, 2012]
  • 37. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Talk Outline 1 Introduction Bias Language Identification 2 Take 1: Starting out Naively 3 Take 2: Removing the Domain Bias 4 Take 3: Removing the Monolingual Bias 5 Take 4: Removing Social Bias 6 Conclusions
  • 38. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Cross-domain Feature Selection Goal: identify n-grams with: high language association low domain association Measure relationship using information gain (= difference in entropy after partitioning the data in a given way)
  • 39. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Language- vs. Domain-based IG 0.05 0.10 0.15 0.20 0.25 0.30 0.35 IG Language 0.1 0.2 0.3 0.4 0.5 0.6 IGDomain 0.0 0.8 1.6 2.4 3.2 4.0 4.8 5.6 log10(N)
  • 40. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What does this Mean? There are two distinct groups of features: (1) IG for language is strongly correlated with that for domain; and (2) IG for language is largely independent of that for domain the second of these is what we are interested in Automatically detecting language- (and not domain-) associated features: LDall (t) = IGall lang (t) − IGdomain(t) Also consider a variant which is debiased to training data volume per language: LDbin (t|l) = IGbin language(t|l) − IGdomain(t) Source(s): Lui and Baldwin [2011]
  • 41. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Computational Bottleneck Calculating the IG for low-order n-grams is fine, but it quickly becomes intractable for larger values of n Ideally, we want a method which scales to (very) large numbers of features/high n-gram orders, with little computational overhead
  • 42. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 DF vs. LD 5000 10000 15000 20000 25000 30000 35000 40000 Document Frequency  0.4  0.3  0.2  0.1 0.0 0.1 0.2 0.3 LangDomainscore 0.0 0.8 1.6 2.4 3.2 4.0 4.8 5.6 log10(N)
  • 43. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Observation Low DF is a good predictor of low LD (but not vice versa) As DF is much cheaper to compute, we first identify the 15000 features with highest DF for a given n-gram order, and assign a LD score of 0 to all features outside this set The final feature representation is a combination of the top-N features for each a predefined set of n-grams
  • 44. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Other Details Byte-based tokenisation Multinomial naive Bayes learner 1–9-grams (although little gained in going beyond 4-grams) Four training/test datasets, plus three test-only datasets (N-EuroGOV, N-TCL and Wikipedia) Source(s): Baldwin and Lui [2010], Lui and Baldwin [2011, 2012]
  • 45. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 In-domain vs. Cross-domain Results JR C -A cquis D ebian R C V 2 W ikipedia 0 20 40 60 80 100 99 96 97 9493 82 58 74 97 87 71 67 99 94 99 92 Accuracy(%) In-domain IGbin lang LDall LDbin
  • 46. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Per-language Accuracy as |L| Increases 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy (Domain Languages) 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy(AllLanguages) TCL Wikipedia EuroGOV
  • 47. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Findings Assuming we have a reasonable number of different domains, LDbin is an effective approach to feature selection, and when paired with DF ranking, can be scaled to very, very large feature sets, and large numbers of languages More training languages tends not to hurt performance, and in some instances actually helps Reference implementation: https://github.com/saffsd/langid.py
  • 48. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Talk Outline 1 Introduction Bias Language Identification 2 Take 1: Starting out Naively 3 Take 2: Removing the Domain Bias 4 Take 3: Removing the Monolingual Bias 5 Take 4: Removing Social Bias 6 Conclusions
  • 49. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What about Multilingual Documents? I
  • 50. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 What about Multilingual Documents? II Reasons for documents being multilingual: code-switching translations boilerplate/interface language multiple users Why detect multilingual documents? pre-filtering for monolingual NLP mining bilingual texts for MT low-density languages on the web Task setting in this section: estimate the relative language proportions for a given mono/multi-lingual document Source(s): Lui et al. [2014]
  • 51. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Generative Mixture Models P(wi) = T j=1 P(wi|zi = j)P(zi = j) P(wi ): Probability of the ith token P(wi |zi = j): Probability of wi given label j P(zi = j): Probability of the ith label being j
  • 52. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Latent Dirichlet Allocation (LDA) Source(s): Blei et al. [2003]
  • 53. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Gibbs Sampling: LDA vs. Multilingual LangID LDA Multilingual LangID labels z topics language tokens w words language-indicative byte sequences token–label distribution φ infer using Gibbs sampler estimate using training data label–document dis- tribution θ infer using Gibbs sampler
  • 54. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 How many Languages? Use Gibbs sampling to infer the most likely mixture of languages over a fixed set: L is the set of languages in the training data find the most likely subset of languages λ⊂L Source(s): Lui et al. [2014]
  • 55. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 < < > <
  • 56. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 WikipediaMulti dataset Актор народився Німеччини. Мати Кларка страждала від епілепсії і померла через десять місяців після його народження, тому хлопчика виростили батько і мачуха, Джені Данлап. У віці 17 років Гейбл зацікавився театром. Але тільки у віці 21 р. він зміг почати працювати у другорозрядних театрах. У Портланді, Орегон Гейбл познайомився з театральним менеджером Джозефіною Діллон, яка дуже допомогла Гейблу у навчанні театральному мистецтву.Türkiye, Kafkasya ve Orta Asya bölgelerini kapsayan akademik ve hakemli dergisi. Disiplinlerarası bir yayın politikası belirleyen OAKA’ya bilimsel ve orijinal makaleler gönderilebiliyor. Yılda 2sayı yayınlanan dergi Bahar ve Güz dönemlerinde çıkıyor. Dergi Türkçe, Türkçe lehçeleri ve İngilizce makaleleri kabul ediyor. Ayrıca başka dillerden çeviri de gönderilebiliyor.Машина на Тюринг е абстрактно изчислително устройство, описано от английския математик Алън Тюринг през 1936 г. Тюринг използва машината, за да даде първото точно определение на понятието компютърните науки, най-вече в областите изчислимост и сложност на алгоритмите, както и в математическата логика. Имената на инструкциите са големите латински букви A, B и C. С малка буква s сме означили специалната инструкция за спиране на изчислението. synthetic dataset mixture of monolingual and multilingual documents K languages per document (1≤K≤5) samples lines from Wikipedia train: 5000 monolingual documents dev: 1000 documents for each K test: 200 documents for each K
  • 57. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 WikipediaMulti results 0.0 0.2 0.4 0.6 0.8 1.0 Our Method Linguini SegLang Precision Recall F-score
  • 58. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Findings Extension of langid.py to include prediction of the language mix in a document (and a Gibbs sampler to predict the language proportions), which outperforms benchmarks on synthetic and real-world data Reference implementation: https://github.com/saffsd/polyglot Source(s): Lui et al. [2014]
  • 59. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Talk Outline 1 Introduction Bias Language Identification 2 Take 1: Starting out Naively 3 Take 2: Removing the Domain Bias 4 Take 3: Removing the Monolingual Bias 5 Take 4: Removing Social Bias 6 Conclusions
  • 60. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Are all Varieties of the Same Language Equal? All good to here, but we are still assuming that languages are homogeneous; is this really the case? Explore the question by constructing more diverse test datasets from Twitter across the dimensions of: 1 geographic diversity 2 multilingual diversity Also generate more socially diverse training datasets Source(s): Jurgens et al. [2017]
  • 61. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 EquiLID: Socially-equittable LangID Attempt to make LangID more socially equitable, with particular focus on Twitter Model details: character-based tokenisation encoder-decoder architecture (3 layers of 512 GRUs) with attention trained over 3 epochs using adadelta Training setup: train on geographic + social/topically-diverse (+ multilingual) data balance training data across languages
  • 62. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Results (Monolingual) langid.py C hrom eC LD E quiLID 0 20 40 60 80 100 96 93 98 23 22 60 77 74 91 38 50 92 F-score(%) Geoµ GeoM Twitter70µ Twittert70M Source(s): Jurgens et al. [2017]
  • 63. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 LangID Recall in Application Scenario Source(s): Jurgens et al. [2017]
  • 64. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Reflections Explicitly training models on diverse, balanced data is critically important in order to achieve “social equity” Need to tease apart training scenario from modelling details (langid.py not retrained, e.g.) Possible extra gains to be had through explicit language-, geographical- and social-level regularisation during training Deep learning model clearly more accurate, but efficiency a concern (but see Botha et al. [2017], e.g.)
  • 65. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Talk Outline 1 Introduction Bias Language Identification 2 Take 1: Starting out Naively 3 Take 2: Removing the Domain Bias 4 Take 3: Removing the Monolingual Bias 5 Take 4: Removing Social Bias 6 Conclusions
  • 66. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Conclusions Domain and social bias prevalent in standard datasets and tools Domain bias well known, but tends to rely on explicit domain adaptation (not always realistic); “generalisation” harder Social bias is an emerging area of NLP that is much less understood, but incredibly important as NLP reaches more and more people in more and more (social) contexts Possible to reduce the bias through explicit training regimes
  • 67. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 Acknowledgements The following collaborators contributed to this work: Steven Bird, Baden Hughes, Jey Han Lau, Marco Lui, Andrew MacKinlay and Jeremy Nicholson This work was supported by the Australian Research Council and NICTA
  • 68. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 References I Timothy Baldwin and Marco Lui. Language identification: The long and the short of the matter. In Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), pages 229–237, Los Angeles, USA, 2010. Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. Jan A. Botha, Emily Pitler, Ji Ma, Anton Bakalov, Alex Salcianu, David Weiss, Ryan McDonald, and Slav Petrov. Natural language processing with small feed-forward networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pages 2869–2875, 2017. Hal Daum´e III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, 2007. Moritz Hardt, Eric Price, , and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29, pages 3315–3323, 2016. URL http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf.
  • 69. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 References II Dirk Hovy and Anders Søgaard. Tagging performance correlates with author age. In Proceedings of the Joint conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015), pages 483–488, 2015. URL http://aclweb.org/anthology/P15-2079. Dirk Hovy and Shannon L. Spruit. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 591–598, 2016. Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and Andrew MacKinlay. Reconsidering language identification for written language resources. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), pages 485–488, Genoa, Italy, 2006. David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. Incorporating dialectal variability for socially equitable language identification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Volume 2: Short Papers, pages 51–57, 2017. Marco Lui and Timothy Baldwin. Cross-domain feature selection for language identification. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), pages 553–561, Chiang Mai, Thailand, 2011. Marco Lui and Timothy Baldwin. langid.py: An off-the-shelf language identification tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012) Demo Session, pages 25–30, Jeju, Republic of Korea, 2012.
  • 70. Domain and Social Bias in Natural Language Processing: A Case Study in Language Identification Grammarly 13/9/2017 References III Marco Lui, Jey Han Lau, and Timothy Baldwin. Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2(Feb):27–40, 2014. Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaˇz Erjavec, Dan Tufis, and D´aniel Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006), Geona, Italy, 2006. Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. A Deeper Look at Dataset Bias. ArXiv e-prints, 2015. Antonio Torralba and Alexei A. Efros. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), pages 1521–1528, 2011. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pages 2941–2951, 2017.